StepFun
Follow
Visit website
Models
-
Step Image Edit 2 is StepFun’s lightweight unified image generation and editing model. It supports both text-to-image and image editing through API endpoints, runs with under 6B parameters, and is designed for fast 1 to 2 second responses in generation and editing workflows.NewMultimodalReleased 1mo ago
-
StepAudio 2.5 TTS is StepFun’s contextual text-to-speech model with performance-oriented vocal control. It combines global and inline context guidance with zero-shot voice cloning so generated speech can follow broader style instructions as well as local delivery details, rather than just reading text flatly.NewMultimodalReleased 1mo ago
-
ACE-Step 1.5 XL is the 4B-parameter Diffusion Transformer decoder in the ACE-Step 1.5 music generation line, built for higher audio quality than the earlier 2B models. The release says it achieves the best scores across all 11 benchmark metrics, surpassing both commercial and open-source models, while remaining compatible with ACE-Step language models from 0.6B to 4B.NewMultimodalReleased 2mo ago
-
Step 3.5 Flash is StepFun’s open source sparse MoE LLM, with 196B total parameters but 11B active per token, tuned for fast agentic reasoning, coding and long context work while staying efficient.TextReleased 4mo ago
-
ACE-STEP v1.5 is an open source, super fast music foundation model that uses a hybrid language model plus diffusion transformer pipeline to turn short prompts into multi minute songs, running on consumer GPUs with under 4 GB VRAMAudioReleased 4mo ago
-
Step-Audio-EditX is a text-guided audio editor for speech, music, and effects. Describe the change and it performs precise, time-aligned edits with low latency.AudioReleased 7mo ago
