LongCat Avatar
Long duration video generation with identity consistent AI
About LongCat Avatar
Introduction to LongCat Avatar
LongCat Avatar is an audio-driven avatar model for long-duration video generation. It focuses on identity consistency, precise lip synchronization, and natural human dynamics, including gestures and idle movements during silent segments. The system is designed to maintain visual quality across extended or theoretically infinite-length sequences.
Built on the LongCat-Video architecture, the model supports multiple generation modes and efficient inference suitable for production workflows. It is relevant for creators, studios, education teams, research groups, and SaaS providers that require consistent, realistic avatar videos at scale.
Key Takeaways
- Supports audio-text-to-video (AT2V), audio-text-image-to-video (ATI2V), and audio-conditioned video continuation in a unified framework
- Cross-Chunk Latent Stitching minimizes quality degradation across long sequences
- Disentangled motion modeling produces natural gestures and idle behavior, even without speech
- Reference Skip Attention preserves identity without copy-paste artifacts
- Multi-person and theoretically infinite-length sequence support
- Efficient inference via coarse-to-fine generation and Block Sparse Attention; optimized for fast 720p/30fps
- Output resolutions up to 1080p; designed for stable long-form content
- Open-source (MIT License) with local deployment support; model size noted at 13.6B parameters
How LongCat Avatar Works
The workflow begins with an audio input (speech, music, or podcast) and optional references (image or text). Users select a generation mode—AT2V, ATI2V, or audio-conditioned video continuation—then configure length, resolution, and whether multi-person generation is needed. The system is optimized for long-form stability and identity consistency across extended sequences.
Technically, LongCat Avatar separates the roles of audio and motion using a disentangled guidance mechanism. Cross-Chunk Latent Stitching reduces visual drift by avoiding redundant decode-encode cycles over long timelines. Reference Skip Attention preserves character identity without rigid cloning. A coarse-to-fine strategy combined with Block Sparse Attention enables practical, production-ready inference at 720p/30fps, with support for higher output resolutions up to 1080p.
Core Benefits and Applications
- Long-form presenters and lectures: Maintain consistent identity and natural delivery across extended recordings.
- Podcasts and interviews: Generate hour-scale speaking videos with stable appearance and lip-sync.
- Entertainment and performance: Produce expressive acting or singing with rhythm-aware movement.
- Sales, marketing, and corporate communications: Create presenters that handle pauses and silent moments naturally.
- Multi-person conversations: Support multi-speaker interactions with individual identity preservation and turn-taking.
Pricing Overview (one-time credit packs)
| Plan | Price | Credits | Approx. Videos | Resolution | Audio Duration (per gen) | Multi-Person | Priority | Notes |
|---|---|---|---|---|---|---|---|---|
| Base | $9.9 | 90 | Up to 18 | 480p/720p/1080p | Up to 60s | Not listed | Standard | Audio-driven avatar generation |
| Pro | $29.9 | 400 | Up to 80 | 480p/720p/1080p | Up to 60s | Yes | Priority | Designed for multi-person support |
| Ultimate | $49.9 | 800 | Up to 160 | 480p/720p/1080p | Up to 60s | Yes (interactions) | Priority | Listed as supporting long-form video generation |
| Creator | $99.9 | 1800 | Up to 360 | 480p/720p/1080p | Up to 60s | Yes | Highest | Listed as “multi-person & infinite-length support,” production-ready architecture, commercial license |
Notes: The model is optimized for long-duration content and theoretically supports infinite-length sequences. Pricing tiers list a 60-second audio duration per generation for credit-based usage, which may reflect service-level constraints rather than model capability.