Composable Supertone TTS toolkit: synthesis, voice search/preview/clone, usage — 31 languages
Composable Supertone TTS toolkit: synthesis, voice search/preview/clone, usage — 31 languages
Supertone TTS · v0.3.1
by Supertone-inc
supertone-mcp
A composable MCP toolkit for the Supertone TTS API. Rather than a single "speak this text" command, it exposes Supertone's SDK as a set of building-block tools — synthesis, voice discovery, preview, duration/credit prediction, usage tracking, and full voice-cloning CRUD — that an LLM assembles to fulfill a request. Works in Claude Desktop, Cursor, or any MCP-compatible client.
Covers Korean, English, Japanese, and 31 languages total. Speed (0.5x–2.0x), pitch shift (-24 to +24 semitones), emotion styles, per-call output mode, streaming, and model selection.
Features
Synthesis
text_to_speech— Convert text to audio. Per-call control ofoutput_mode(files / resources / both),autoplay,streaming,model, plusinclude_phonemes/normalized_text. Long text is auto-chunked by the SDK.predict_duration— Estimate audio length (and credit cost) without synthesizing.
Voice discovery (preset)
search_voice— Filter the catalog by language, gender, age, use_case, style, model, name, or description.get_voice— Full detail for one voice.preview_voice— Sample audio URLs for a voice (filterable by language/style/model).
Custom voice cloning
clone_voice— Create a cloned voice from a local WAV/MP3 (≤3MB).search_custom_voice— List/filter cloned voices.get_custom_voice— Full detail for one cloned voice.edit_custom_voice— Update name and/or description.delete_custom_voice— Permanently delete (irreversible).
Audio assembly
merge_audio_files— Concatenate two or more local audio files (mp3/wav) into one via a bundled ffmpeg. Supports plain concat, silence gaps between clips (gap_ms), or crossfade blending (crossfade_ms). Output format auto-detected (mixed → mp3) or forced viaoutput_format. No system ffmpeg required.
Usage & credits
get_credit_balance— Remaining credits.get_usage_history— Usage over a time window.get_voice_usage— Usage for a specific voice.
Breaking changes & migration (0.2.0)
0.2.0 moves behavior control out of environment variables and into per-call tool parameters — so the LLM decides per request, not the server config.
| Before (env var) | After (per-call parameter) | Note |
|---|---|---|
SUPERTONE_MCP_OUTPUT_MODE=files|resources|both |
text_to_speech(output_mode=...) |
Default still files |
SUPERTONE_MCP_AUTOPLAY=true |
text_to_speech(autoplay=...) |
Default changed true → false (playback is now explicit) |
| (always streamed) | text_to_speech(streaming=...) |
New, default false (one-shot). streaming=true requires model="sona_speech_1" |
Other changes:
- Default model changed
sona_speech_1→sona_speech_2_flash. list_voiceswas removed (since the discovery release) and replaced bysearch_voice— call it with no arguments to reproduce the old "list everything" behavior.- No more hard 300-character limit — longer text is auto-chunked by the SDK (credit/latency scale with length).
If you previously set SUPERTONE_MCP_OUTPUT_MODE or SUPERTONE_MCP_AUTOPLAY, remove them from your client config and pass output_mode / autoplay per call instead. (The server prints a one-time stderr notice if it sees the removed vars.)
Installation
# Using uvx (recommended)
uvx supertone-mcp
# Using pip
pip install supertone-mcp
Configuration
Claude Desktop
Add to claude_desktop_config.json:
{
"mcpServers": {
"supertone-tts": {
"command": "uvx",
"args": ["supertone-mcp"],
"env": {
"SUPERTONE_API_KEY": "your-api-key-here"
}
}
}
}
Cursor
Add to your Cursor MCP settings (same JSON shape as above).
Environment Variables
Only authentication and stable defaults are configured via the environment — all behavior is controlled per call.
| Variable | Required | Default | Description |
|---|---|---|---|
SUPERTONE_API_KEY |
Yes | — | Your Supertone API key |
SUPERTONE_MCP_VOICE_ID |
No | preset voice (Aiden, multilingual) | Default voice_id for text_to_speech / predict_duration (override per call) |
SUPERTONE_OUTPUT_DIR |
No | ~/supertone-tts-output/ |
Directory where audio files are saved (used by output_mode=files/both) |
Removed in 0.2.0:
SUPERTONE_MCP_OUTPUT_MODEandSUPERTONE_MCP_AUTOPLAY— see Migration.
Output modes (text_to_speech output_mode)
| Mode | Returns | Use when |
|---|---|---|
files (default) |
Plain text with the saved file path + metadata | You want the file on disk |
resources |
MCP AudioContent + TextContent (no file written) |
The client renders audio inline (e.g., Claude.ai chat) |
both |
File on disk and AudioContent/TextContent |
You want both — preview inline, keep the file |
Usage Examples
The MCP client routes natural-language requests across these tools — the value of the toolkit is composition: the LLM chains several tools to satisfy one request.
Example 1 — Discover → preview → estimate cost → synthesize
"Find a calm Korean female voice, let me hear a sample, check the cost, then make this announcement as an mp3."
The LLM assembles:
search_voice(language="ko", gender="female", style="neutral") # find candidates
→ preview_voice(voice_id) # sample URLs to confirm the voice
→ predict_duration(text, voice_id) + get_credit_balance() # gauge cost before spending
→ text_to_speech(text, voice_id, output_format="mp3",
output_mode="files") # synthesize
Example 2 — Clone my voice → use it right away
"Make a cloned voice from ~/recordings/sample.wav named MyVoice, then read this greeting with it and play it for me."
The LLM assembles:
clone_voice(name="MyVoice", audio_path="~/recordings/sample.wav") # create the cloned voice
→ get_custom_voice(voice_id) # confirm it was created
→ text_to_speech(text, voice_id=<cloned>, autoplay=true) # synthesize, then play immediately
autoplayis a per-call parameter (defaultfalse), so playback happens only when explicitly requested.
Tool Parameters
text_to_speech
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
text |
string | Yes | — | Text to convert (long text is auto-chunked by the SDK) |
voice_id |
string | No | env or preset | Voice identifier (browse via search_voice) |
language |
string | No | ko |
Language code — one of 31 (ko, en, ja, …) |
output_format |
string | No | mp3 |
mp3 or wav |
model |
string | No | sona_speech_2_flash |
sona_speech_1, sona_speech_2, sona_speech_2_flash, sona_speech_2t, sona_speech_3t, supertonic_api_1, supertonic_api_3 |
speed |
float | No | 1.0 |
0.5–2.0 |
pitch_shift |
int | No | 0 |
-24 to +24 semitones |
style |
string | No | — | Emotion style (varies by voice) |
output_mode |
string | No | files |
files, resources, or both (see Output modes) |
autoplay |
bool | No | false |
Play the audio locally after synthesis (macOS afplay) |
streaming |
bool | No | false |
Stream synthesis. Only supported by model="sona_speech_1" |
include_phonemes |
bool | No | false |
Return phoneme timing data alongside the audio |
normalized_text |
string | No | — | Pre-normalized text (only used by sona_speech_2 / sona_speech_2_flash) |
predict_duration
Same core parameter schema as text_to_speech (long text auto-chunked). Returns "Predicted duration: 2.34s (credit usage is proportional to duration).".
search_voice
All parameters optional. With no filters → full catalog. With any filter → first response line is Filters applied: ....
| Parameter | Type | Description |
|---|---|---|
language |
string | e.g., ko, en, ja |
gender |
string | e.g., male, female |
age |
string | e.g., young_adult, child |
use_case |
string | e.g., narration, advertisement |
style |
string | e.g., neutral, happy |
model |
string | e.g., sona_speech_2_flash |
name |
string | partial match |
description |
string | partial match |
get_voice / preview_voice
| Tool | Required | Optional |
|---|---|---|
get_voice |
voice_id |
— |
preview_voice |
voice_id |
language, style, model (filter samples) |
clone_voice
| Parameter | Type | Required | Description |
|---|---|---|---|
name |
string | Yes | Display name (non-empty) |
audio_path |
string | Yes | Local WAV or MP3 path (≤3MB). Supports ~ expansion |
description |
string | No | Optional note |
Custom voice CRUD
| Tool | Required | Optional |
|---|---|---|
search_custom_voice |
— | name, description (partial match) |
get_custom_voice |
voice_id |
— |
edit_custom_voice |
voice_id |
name, description (at least one required) |
delete_custom_voice |
voice_id |
— (IRREVERSIBLE) |
Usage & credits
| Tool | Required | Optional |
|---|---|---|
get_credit_balance |
— | — |
get_usage_history |
— | — (reports a recent default window) |
get_voice_usage |
voice_id |
— |
merge_audio_files
| Parameter | Type | Required | Description |
|---|---|---|---|
input_paths |
string[] | Yes | Two or more local mp3/wav paths (~ expansion supported). A single file is returned as-is |
gap_ms |
int | No | Silence (ms) inserted at each junction. Default 0. Mutually exclusive with crossfade_ms |
crossfade_ms |
int | No | Crossfade blend (ms) at each junction. Default 0. Mutually exclusive with gap_ms |
output_format |
string | No | Force mp3 or wav. If omitted: all-same-ext → that ext; mixed → mp3 |
Development
# Clone and install
git clone https://github.com/supertone-inc/supertone-mcp.git
cd supertone-mcp
uv sync
# Run tests
uv run pytest -q
# Run with coverage
uv run pytest --cov=src --cov-report=term-missing
License
MIT