High-accuracy audio/video transcription and subtitle generation powered by ElevenLabs Scribe.
Elevate wraps the ElevenLabs Speech-to-Text API into a battle-tested CLI pipeline that handles everything from 30-second trailers to 3-hour movies. Drop in a file or a YouTube URL, get production-ready subtitles.
- State-of-the-art accuracy β ElevenLabs Scribe v2 delivers the lowest word error rate across 90+ languages, outperforming Whisper, Deepgram, and AssemblyAI on most benchmarks.
- CJK-aware subtitle pipeline β purpose-built for Chinese, Japanese, and Korean. Sentence splitting respects CJK punctuation, line breaking uses character-width logic, and reading speed targets are tuned per script (CJK CPS vs Latin CPS).
- Speaker diarization β up to 32 speakers, automatically labeled in the transcript.
- Audio event tagging β
[laughter],[applause],[music]and other non-speech sounds are captured with accurate timestamps. - URL transcription β transcribe YouTube, TikTok, or any hosted video/audio URL directly. ElevenLabs downloads the media server-side; nothing is saved locally.
- Chunked processing β long files are automatically split, transcribed in parallel, and merged back with correct timestamps. Crash recovery via state files means you never re-upload a completed chunk.
- API key rotation β add multiple ElevenLabs keys, each tracked with per-key usage stats. When one key hits its quota, the next one picks up automatically.
- SOCKS5 proxy β native SOCKS5 support for regions where ElevenLabs is not directly reachable.
- FFmpeg progress β real-time percentage display during audio extraction from video files.
- Intelligent duration clamping β word-level timestamp correction prevents subtitles from displaying too long (common STT artifact), reducing >7s subtitle occurrences by ~50%.
- Go 1.21+ (to build from source)
- FFmpeg (for video files)
- An ElevenLabs API key β sign up free (4.5 hours STT/month, no credit card)
git clone <repo-url> && cd elevate
go build -o elevate ../elevate keys add sk-your-elevenlabs-key-hereYou can add multiple keys for automatic rotation:
./elevate keys add sk-key-one
./elevate keys add sk-key-two
./elevate keys import keys.txt # one key per line# Local video file (auto-extracts audio, splits if >8min, generates SRT)
./elevate transcribe movie.mkv
# YouTube URL (zero download β ElevenLabs fetches it server-side)
./elevate transcribe --url "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
# Batch process a directory
./elevate batch /path/to/videos/
# Specify language for better accuracy
./elevate transcribe --language zh movie_mandarin.mkv
# Choose a specific audio stream (0-based)
./elevate transcribe --stream 1 movie_with_multiple_audio.mkvFor movie.mkv, Elevate produces:
| File | Content |
|---|---|
movie.srt |
Production-ready subtitles |
movie.transcript.json |
Raw API response with word-level timestamps |
On first run, Elevate creates ~/.config/elevate/config.toml with sensible defaults:
[api]
model = "scribe_v2"
language = "" # empty = auto-detect
diarize = true
tag_audio_events = true
timestamps_granularity = "word"
[proxy]
url = "" # e.g. "socks5://127.0.0.1:2080"
[subtitle]
min_duration = 0.8
max_duration = 7.0
cjk_cps = 9.0 # characters per second (CJK)
latin_cps = 21.0 # characters per second (Latin)
cjk_chars_per_line = 18
latin_chars_per_line = 42
clamp_factor = 2.5 # word duration clamping multiplier
max_word_duration = 3.0 # absolute max word duration (seconds)
[processing]
split_threshold_min = 8 # split files longer than N minutes
max_concurrent_uploads = 4
max_retries = 3
[output]
save_transcript_json = trueelevate keys list # show all keys with usage stats
elevate keys add <key> # add and verify a key
elevate keys remove <key> # remove a key
elevate keys import <file> # bulk import from fileKeys are stored in ~/.config/elevate/keys.json with per-key usage tracking (request count, total audio seconds, last used timestamp). Keys rotate automatically β when one hits its quota, the next active key takes over.
cmd/ CLI commands (cobra)
internal/
api/ ElevenLabs HTTP client, retry logic, error classification
config/ TOML config with auto-creation
engine/ Orchestrator: probe β extract β split β upload β merge β generate
keys/ Multi-key manager with round-robin rotation and usage tracking
media/ FFmpeg wrapper: probe, extract, split, transcode, progress
proxy/ SOCKS5 dialer integration
subtitle/ Pipeline: word splitting β duration clamping β sentence merging β SRT
util/ CJK detection, time formatting
| Component | Technology |
|---|---|
| Language | Go |
| STT API | ElevenLabs Scribe v1/v2 |
| CLI | Cobra |
| Config | TOML |
| Media | FFmpeg/ffprobe |
| Proxy | golang.org/x/net/proxy (SOCKS5) |
- ElevenLabs merged token bug β Scribe occasionally merges sentence-ending punctuation with the next word (e.g.,
οΌHarryγ). Affects ~10 tokens per 2-hour film, primarily with English names in CJK speech. Tracked upstream at elevenlabs-python#607. - Non-deterministic results β the STT model may return slightly different transcripts for the same audio across API calls. Use the
seedparameter (planned) for reproducibility. - URL mode skips chunking β
--urlsends the full URL to ElevenLabs; local chunking does not apply. Files up to 10 hours / 3 GB are supported by the API.
GPL3