SAN FRANCISCO, Jan. 8, 2026 /PRNewswire/ -- CraftStory, a pioneer in realistic AI-generated human video, today announced the release of its Image-to-Video model, an expansion of Model 2.0 that enables users to generate up to five-minute, studio-quality human videos from a single image and a written script.

A first-of-its-kind, CraftStory launched its first Video-to-Video model in November 2025. This breakthrough model enabled users to generate up to five minutes of video by animating a still image using motion captured from a driving video.

Model 2.0 builds on CraftStory's existing suite of models and introduces a new capability that removes the need for source footage. Companies can now create expressive, long-form videos starting from nothing more than a photo and text, while preserving the same realism, continuity, and performance quality previously available only through Video-to-Video workflows.

Turning Images into Performances

As video becomes a primary communication channel for companies, teams face a familiar bottleneck: producing consistent, human-led content at scale is still slow, expensive, and difficult to update. While short AI clips exist, they often lack expressive motion, break down over time, or fail to sustain realism beyond a few seconds.

CraftStory's Image-to-Video model addresses this gap by transforming a single image into a complete, multi-minute performance, driven entirely by script or audio. The system generates natural facial expressions, body language, and gestures that evolve coherently over time — making it suitable for product explainers, training videos, customer communication, and educational content.

How Image-to-Video Works

With Image-to-Video, users upload:

a single image of a person, and

a script or audio track

CraftStory Model 2.0 then synthesizes a full video performance, animating both the person and the environment — delivering realistic lip-sync, expressive gestures, and scene motion aligned with speech rhythm and emotional tone.

The model shares the same core architecture as CraftStory's video-to-video system, including:

Advanced gesture generation algorithms that infer appropriate hand and body movements directly from audio

that infer appropriate hand and body movements directly from audio High-fidelity lip-sync , producing natural speech articulation over long sequences

, producing natural speech articulation over long sequences Identity preservation, maintaining consistent appearance, emotion, and nuance throughout multi-minute videos

Model 2.0 also includes an advanced lip-sync system that turns any script or audio track into a realistic performance. A built-in gesture alignment algorithm ensures that body movements naturally match speech rhythm and emotion — bringing human expressiveness to AI-generated content.

Walk-and-Talk Videos with Moving Cameras

CraftStory is also introducing support for moving cameras. Model 2.0 can now generate walk-and-talk videos up to 80 seconds long, where the person moves naturally through the scene while speaking and the camera tracks the motion. This enables dynamic, cinematic shots that stand out from static, on-camera videos that often look the same. The feature is currently in beta and will be rolled out gradually to existing accounts.

Built for Long-Form Consistency

At the core of Model 2.0 is a proprietary parallelized diffusion pipeline, designed to scale human video generation beyond short clips. By processing different temporal segments simultaneously while enforcing global coherence, the system maintains visual consistency across minutes of footage — a key challenge in long-form video synthesis.

The model was trained on high-frame-rate footage of real actors, capturing subtle facial dynamics as well as expressive hand and body motion. This allows Image-to-Video outputs to feel fluid and human, rather than static or robotic.

Videos can be generated in both portrait and landscape formats, at 480p and 720p, with optional upscaling to 1080p.

From Scripts to Human Communication

"Image-to-Video is a major step toward fully script-driven video creation," said Victor Erukhimov, Founder and CEO of CraftStory, who previously sold his computer vision startup to Intel. "You no longer need to record a video to get a realistic human performance. If you have an image and something to say, Model 2.0 can turn that into a believable, long-form video — complete with gestures and expressiveness that match the message."

What's Next

CraftStory is advancing Model 2.0 toward fully automated text-to-video workflows, with a focus on making marketing video creation faster, simpler, and more scalable for everyday use.

To try Image-to-Video with Model 2.0, visit https://craftstory.com/.

About CraftStory

CraftStory is a pioneer in realistic AI-generated human video, founded by the creators of OpenCV. The company enables businesses to create studio-quality, long-form videos at scale using AI. For more information, please visit https://craftstory.com/

