Batch Transcript Guide

Best practices for fetching transcripts at scale — concurrency tuning, caching, partial failures, and memory

When to use batch transcript

ScenarioUse transcribeVideo()Use transcribePlaylist()
1–3 videosYesNo
Full playlist (any size)Manual loopYes
Playlist with range (e.g. 10–30)Manual loopYes (with from/to)
Need progress trackingManualYes (built-in)

Concurrency tuning

The concurrency option controls how many transcripts are fetched in parallel. The default is 3.

Low concurrency (1–2)

Use when:

  • Running on a server with limited bandwidth
  • YouTube is rate-limiting your IP frequently
  • You need strict sequential processing
const result = await transcribePlaylist(playlistId, {
  apiKey,
  concurrency: 1,
})

Default concurrency (3)

Good for most use cases. Balances speed with YouTube's tolerance for parallel requests.

High concurrency (5–10)

Use when:

  • Fetching large playlists (100+ videos)
  • Running behind a proxy with IP rotation
  • Caching is enabled (most requests won't hit YouTube)
const result = await transcribePlaylist(playlistId, {
  apiKey,
  concurrency: 8,
  cache: new InMemoryCache(),
  retries: 2,
})

Concurrency and failures

Higher concurrency means more simultaneous connections to YouTube. If failures increase when you raise concurrency, that's a sign of rate limiting. Either lower the value or add retries.


Combining with cache

Batch transcript benefits massively from caching. On the first run, all transcripts are fetched from YouTube. On subsequent runs with the same cache instance, only new or expired entries hit the network.

import { transcribePlaylist, InMemoryCache } from 'lyra-sdk/transcript'

const cache = new InMemoryCache()

// First run — fetches everything (N * 3 HTTP requests)
const result1 = await transcribePlaylist(playlistId, {
  apiKey,
  cache,
})

// Second run — all transcripts from cache (0 HTTP requests for transcripts)
const result2 = await transcribePlaylist(playlistId, {
  apiKey,
  cache,
})

The cache stores individual video transcripts, not the entire playlist result. Adding or removing videos from the playlist only affects cache misses for the changed videos.

FsCache for persistent caching

For CLI tools or long-running services, use FsCache to persist across restarts:

import { FsCache } from 'lyra-sdk/transcript'

const cache = new FsCache('./transcript-cache', 86400000) // 24-hour TTL

Handling partial failures

In any playlist, some videos will fail to transcribe. Common reasons:

ReasonError message
Video is privateVideo unavailable
Video is deletedVideo unavailable
Captions disabledTranscripts are disabled for this video
No captions availableCould not find a transcript
Language not availableLanguage "de" is not available
Rate limitedYou are receiving rate limit errors

Pattern: retry failed videos

const result = await transcribePlaylist(playlistId, { apiKey, retries: 2 })

if (result.failed > 0) {
  const failedIds = result.results
    .filter(r => r.status === 'failed')
    .map(r => r.videoId)

  console.log(`Retrying ${failedIds.length} failed videos...`)

  // Retry individually with longer delay
  for (const id of failedIds) {
    try {
      const lines = await transcribeVideo(id, { retries: 3, retryDelay: 2000 })
      console.log(`  ✓ ${id}`)
    } catch {
      console.log(`  ✗ ${id} — still failing`)
    }
  }
}

Pattern: log and continue

const result = await transcribePlaylist(playlistId, {
  apiKey,
  onProgress(done, total, videoId, status) {
    if (status === 'failed') {
      // Log to your error tracking service
      captureFailedTranscript(videoId, playlistId)
    }
  },
})

Memory considerations

Each TranscriptLine is a small object (~100 bytes). A typical 10-minute video produces 50–150 lines. For a 1000-video playlist:

  • Lines data: ~1000 × 100 lines × 100 bytes ≈ 10 MB
  • Video titles + metadata: ~100 KB
  • Total result object: ~10 MB

This fits comfortably in memory for playlists up to several thousand videos. For very large playlists (10,000+), process in ranges:

const BATCH_SIZE = 500
const info = await client.playlistInfo(playlistId)
const totalVideos = info.videoCount

for (let from = 1; from <= totalVideos; from += BATCH_SIZE) {
  const to = Math.min(from + BATCH_SIZE - 1, totalVideos)
  const result = await transcribePlaylist(playlistId, { apiKey, from, to })

  // Process this batch, then allow GC to collect the result
  await processBatch(result)
}

Progress patterns

CLI progress bar

const result = await transcribePlaylist(playlistId, {
  apiKey,
  onProgress(done, total, videoId, status) {
    const pct = Math.round((done / total) * 100)
    const bar = '█'.repeat(Math.floor(pct / 2)) + '░'.repeat(50 - Math.floor(pct / 2))
    process.stdout.write(`\r  [${bar}] ${pct}% (${done}/${total})`)
  },
})
console.log('\n')

WebSocket progress

// Server-side: broadcast progress to connected clients
wss.on('connection', (ws) => {
  ws.on('message', async (msg) => {
    const { playlistId } = JSON.parse(msg.toString())

    const result = await transcribePlaylist(playlistId, {
      apiKey,
      onProgress(done, total, videoId, status) {
        ws.send(JSON.stringify({ type: 'progress', done, total, videoId, status }))
      },
    })

    ws.send(JSON.stringify({ type: 'complete', result }))
  })
})

Next steps

On this page