Setup Guide

InfiniteTalk ComfyUI Integration — Complete Setup Guide

A step-by-step walkthrough for integrating InfiniteTalk with ComfyUI so you can generate unlimited-length talking avatar videos with accurate lip sync and natural body movements.

InfiniteTalk Team30 min readSetup · ComfyUI · AI Video

What is InfiniteTalk ComfyUI Integration?

InfiniteTalk is a new talking avatar framework from the MultiTalk team that enables audio-driven video generation for creating talking avatar videos. One of its headline features is the ability to generate videos of infinite length.

This means you're no longer limited to short 10–15 second clips — you can generate minutes of content, or even longer, as long as your machine has enough RAM and VRAM. The model is still audio-driven, generating image-to-video output with natural lip syncing and enhanced body motions while the character speaks.

Why ComfyUI? The ComfyUI integration gives you a node-based visual workflow instead of command-line usage — ideal for experimenting, chaining post-processing nodes, and building repeatable pipelines.

Setting Up InfiniteTalk in ComfyUI

Follow these three steps to get InfiniteTalk running inside ComfyUI.

1

Update the Juan Video Wrapper

If you already use ComfyUI, update the Juan Video Wrapper custom node to its latest version — it now ships with InfiniteTalk support built in. New users can download it directly from GitHub.

2

Download InfiniteTalk Model Files

Go to the official Hugging Face repository for InfiniteTalk. Under the file versions tab you'll find a ComfyUI folder containing AI models exported specifically for ComfyUI. Inside you'll see two files: InfiniteTalk Single (one person) and InfiniteTalk Multi (multiple people). Start with the single version for initial testing.

3

Install Model Files

Drop the downloaded .safetensors files into the diffusion_models subfolder inside your ComfyUI models/ directory. You can create a dedicated subfolder (e.g. InfiniteTalk/) for better organisation.

Creating Your First Workflow

Using the Example Workflow

The easiest starting point is the example workflow bundled with the Juan Video Wrapper. After updating the custom node, you'll notice the MultiTalk nodes have been renamed: they now appear as MultiTalk and Infinite MultiTalk.

Model Selection

In the MultiTalk / InfiniteTalk model loader, select the InfiniteTalk model file you downloaded. For single-person use cases, choose the single variant. The surrounding node setup (block swap, Torch compile, VAE, CLIP text encoder) is identical to the previous MultiTalk workflow, so existing setups require minimal changes.

Optimisation Settings

By default the workflow uses the image-to-video LightX2V model to speed up sampling. Lowering the sampling step count reduces generation time at a small quality cost. 480p resolution is recommended for most machines — 720p requires significantly more VRAM and was unstable in early tests.

Advanced Features

Multiple People & Audio Tracks

InfiniteTalk inherits the multi-speaker capability from MultiTalk. You can pass in separate audio tracks and assign reference target masks to each person you want animated — ideal for dialogue scenes or podcast-style content.

Text-to-Speech Integration

Connect a TTS node (such as Chatterbox SRT Voice) upstream of the InfiniteTalk node. Type or load your script and the TTS node generates the audio automatically, removing the need to prepare audio files externally.

Long-Form Content Generation

The system calculates required video length from the audio duration automatically, making it straightforward to produce full podcast episodes or long explainer videos without manual trimming.

Frame Interpolation

After generation, run a frame interpolation node to double the FPS. This meaningfully improves perceived smoothness and reduces minor artefacts like rapid eye blinking that can appear at the native frame rate.

Performance & Quality

Chunk-Based Processing

During sampling you'll see the video processed in chunks — for example, 81 frames per chunk with 25 overlapping frames carried into the next segment. This overlap is what keeps the animation smooth and consistent across the full video duration.

Hardware Requirements

For 480p generation, most modern GPUs with 6 GB+ VRAM are sufficient. 720p or very long videos require more VRAM and system RAM. Torch compile support is recommended for best throughput on CUDA devices.

InfiniteTalk vs MultiTalk

InfiniteTalk

  • Unlimited video length
  • More natural body language
  • Better lip sync accuracy
  • Fewer artefacts and distortions
  • Improved stability for long clips

MultiTalk Limitations

  • Limited to short clips
  • Occasional overreaction or weird motion
  • Less natural body language
  • More artefacts in longer sequences
  • Inconsistent quality over time

Tips & Best Practices

Audio Quality

Use clean, high-quality audio without background noise. Better audio directly improves lip sync accuracy and facial expression fidelity.

Image Selection

Choose a well-lit, high-resolution photo with clear facial features. Input image quality has a direct impact on the output video quality.

Sampling Steps

Start with 4–8 steps for fast iteration. Increase steps for final renders when you're happy with the composition.

Post-Processing

Always apply frame interpolation after generation. Doubling FPS significantly smooths motion and reduces flickering artefacts.

Try InfiniteTalk Online — No Setup Required

Don't have a high-end GPU? Use our cloud-based InfiniteTalk tool directly in the browser. Upload an image, provide audio, and generate professional talking avatar videos in minutes.

Try InfiniteTalk Free →