Long duration video generation with identity consistent AI
LongCat Avatar is an audio-driven avatar model for long-duration video generation. It focuses on identity consistency, precise lip synchronization, and natural human dynamics, including gestures and idle movements during silent segments. The system is designed to maintain visual quality across extended or theoretically infinite-length sequences.
Built on the LongCat-Video architecture, the model supports multiple generation modes and efficient inference suitable for production workflows. It is relevant for creators, studios, education teams, research groups, and SaaS providers that require consistent, realistic avatar videos at scale.
The workflow begins with an audio input (speech, music, or podcast) and optional references (image or text). Users select a generation mode—AT2V, ATI2V, or audio-conditioned video continuation—then configure length, resolution, and whether multi-person generation is needed. The system is optimized for long-form stability and identity consistency across extended sequences.
Technically, LongCat Avatar separates the roles of audio and motion using a disentangled guidance mechanism. Cross-Chunk Latent Stitching reduces visual drift by avoiding redundant decode-encode cycles over long timelines. Reference Skip Attention preserves character identity without rigid cloning. A coarse-to-fine strategy combined with Block Sparse Attention enables practical, production-ready inference at 720p/30fps, with support for higher output resolutions up to 1080p.
| Plan | Price | Credits | Approx. Videos | Resolution | Audio Duration (per gen) | Multi-Person | Priority | Notes |
|---|---|---|---|---|---|---|---|---|
| Base | $9.9 | 90 | Up to 18 | 480p/720p/1080p | Up to 60s | Not listed | Standard | Audio-driven avatar generation |
| Pro | $29.9 | 400 | Up to 80 | 480p/720p/1080p | Up to 60s | Yes | Priority | Designed for multi-person support |
| Ultimate | $49.9 | 800 | Up to 160 | 480p/720p/1080p | Up to 60s | Yes (interactions) | Priority | Listed as supporting long-form video generation |
| Creator | $99.9 | 1800 | Up to 360 | 480p/720p/1080p | Up to 60s | Yes | Highest | Listed as “multi-person & infinite-length support,” production-ready architecture, commercial license |
Notes: The model is optimized for long-duration content and theoretically supports infinite-length sequences. Pricing tiers list a 60-second audio duration per generation for credit-based usage, which may reflect service-level constraints rather than model capability.