LTX-2 Video Generation Prompt Engineering: From 36-Scene Horror to Cinematic Continuity Pipelines

Summary

When producing watchable video from LTX-2, prompt structuring is key. Generating a single pretty shot and chaining 36 scenes into a coherent story require fundamentally different engineering. This article consolidates three specifications refined through production use:

36-scene horror scenario generation template — system prompt for LLM-driven scenario writing
Cinematic prompt design principles — structure for maximizing individual shot quality
Multi-scene continuity control — pipeline design for preventing inter-clip visual breakage

Prerequisites

Video generation engine: LTX-2 (5-second clip generation)
Resolution: 3840x2160 (4K), 24fps
Scenario generation: Local LLM (Mixtral, etc.) generating STORY→SCENES→LTX2_PROMPTS in one pass
Pipeline: Per-scene generation → ffmpeg concatenation → audio composited downstream

1. 36-Scene Horror Scenario Generation Template

Design Intent

Background: When experimenting with Monstral-123B (NVFP4 quantization) for horror scenario generation, we encountered critical instabilities:

Scene counts varied between 20–50; achieving a fixed 36-scene structure was impossible
Narrative structure collapsed; act structure (setup/escalation/resolution) became unclear
Dialogue disappeared midway; mute scenes proliferated
Endings became ambiguous or devolved into abrupt “it was all a dream” tropes

These weren’t LLM performance failures—they stemmed from the absence of explicit output format constraints. Simply asking an LLM to “write a horror video scenario” grants too much freedom, destabilizing generation results. By designing a system prompt that rigorously constrains output schema and rules, we enabled mid-size models like Monstral-123B to reliably generate stable 36-scene horror narratives.

Output Schema (Fixed)

LLM output is restricted to exactly these 5 sections. No extra commentary.

STORY_PROMPT — narrative skeleton (one paragraph)
LABELS — genre, tone, motifs
OUTLINE — 5-stage plot summary
SCENES_36 — Scene 01 through Scene 36 (1-3 sentences each)
LTX2_PROMPTS_36 — Shot 01 through Shot 36 (generation prompt per shot)

Scene Allocation

01-08: Daily life and subtle unease (first whispers, minor anomalies)
09-20: Escalation and repetition (voices repeat, reflections speak back)
21-28: Investigation and confrontation (identifying the cause, revealing truth)
29-36: Resolution and closure (cause identified → concrete action → safety restored)

Dream endings, “it was all imagination” dismissals, and unresolved ambiguity are prohibited.

Required Elements Per Scene

  Scene 04:
Visual: A woman stands before a mirror. Camera slowly zooms into her reflection.
Whisper: "You see it too, don't you?"
Sound: Fluorescent light flickers with a low humming sound.

Visual: what is on screen
Camera: framing or camera movement
Dialogue: at least one spoken line (Dialogue / Whisper / Voice(V.O.) / Heard Voice / Inner Voice)

No scene may be silent. For scenes where dialogue is unnatural, use whispered words, distorted voices, internal monologue, reflected/mirrored speech, or remembered phrases.

Setting Rules

Japan (urban or suburban), deliberately vague location names
Cultural elements: quiet residential streets, small houses, mirrors, shrines, corridors, sliding doors, evening TV noise, cicadas, wind, fluorescent lights
No graphic gore or explicit violence. Fear is built through psychological means: whispers, shadows, reflections, repetition, isolation

Story Flavors (Presets)

Pre-built flavor templates the LLM can auto-select when user input is vague:

trapped_room_mummy: locked in before summer break, only traces remain
water_reflection: water surface reflections gradually rewrite reality
station_beep_phrase: station chimes become words that guide people
fox_shrine_wrath: removal of a small shrine triggers quiet anomalies

LTX2_PROMPTS_36 Shot Definition

  Shot 01:
  DURATION=5s FPS=24 RES=3840x2160
  PROMPT: <concrete visual instruction, short>
  NEGATIVE: low quality, blurry, distorted hands, deformed face, gore, blood
  CAMERA: handheld POV / slow push-in / over-the-shoulder
  LIGHTING: low-key, streetlight, fluorescent hum
  AUDIO_CUE: <ambient, dialogue, silence>
  CONTINUITY: prev=none next=shared_object:mirror
  SEED_HINT: episode_seed+01

Avoid face close-ups; prefer hands, backs, silhouettes. Telop/subtitles are composited downstream via ffmpeg.

2. Cinematic Prompt Design Principles

Shot-First Thinking

The first thing in every prompt is “where the camera is.” Abstract phrasing like “the camera shows” is banned. Use explicit cinematography language:

static camera / slow pan / close framing / wide interior shot / shallow depth
Specify camera position, field of view, and spatial compression

Environment Anchoring (Visual Mood)

Embed lighting, color palette, and surface textures as shared parameters across prompts:

Lighting: warm, flickering, fluorescent, natural, overcast
Color: muted pastels, warm golds, sickly greens
Texture: fogged glass, worn metal, dust in light beams

This anchors the diffusion process and stabilizes results across scenes.

Action as Continuous Physical Sequence

Write actions as natural progression, not bullet points:

  Leaning into frame → Hesitating near the handle → Exhaling slowly to fog the glass

Use arrows (→) to show step-by-step motion. Without this, LTX-2 produces “teleportation” — the model cannot predict intermediate frames when motion is abbreviated. Obsessively describing how the body moves through space is an effective approach to smooth video.

Character Definition Through Behavior

Define characters through posture, micro-expressions, timing, and small physical habits rather than long descriptions. Include age, clothing, and emotional state expressed through motion.

Camera Movement as Narrative Tool

Slow pan: observation and tension building
Static shots: tension or comedic timing
Maintain consistent pan direction and speed across adjacent clips
When continuing from previous scene, explicitly state Continue the same pan

Audio Is Mandatory

Audio is not decoration — it defines timing:

Ambient Sound: oven humming, steam, distant noise
Dialogue Beats: quoted dialogue, [Beat] and [Silence] markers for pauses
Music must be explicitly included or excluded

Prompt Structure Template

Shot Establishment
Environment & Lighting
Character Position & Emotion
Core Action Sequence
Camera Movement (with timing)
Audio & Dialogue
Ending Visual Beat (pose, pause, reaction)

3. Multi-Scene Continuity Control

The Problem

Moving from single beautiful shots to multi-scene narratives introduces:

Visual drift: lighting and texture shift between shots
Camera discontinuity: pan speed and direction mismatch across clips
Timing loss: inability to control acting “beats” and pauses

Last-Frame Continuity

Continuation directive: explicitly state Continue the same pan at the start of the next scene
Seed incrementing: use a fixed episode seed plus shot index (episode_seed + shot_idx). Allows micro-variation while maintaining macro-consistency
Texture inheritance: visual states from previous shots (e.g., fogged glass) become preconditions for the next shot

CONTINUITY Field in Practice

Each shot’s CONTINUITY field specifies prev= and next=, connecting adjacent scenes through shared objects, shared sounds, continuing actions, or location transitions:

  CONTINUITY: prev=shared_sound:fluorescent_hum next=location_transition:hallway_to_kitchen

Without explicit connection directives, LTX-2 generates each shot as an independent image, producing jarring cuts when concatenated.

Avoiding Teleportation

When physical motion is abbreviated, the AI cannot predict intermediate frames and produces “teleportation.” The most effective countermeasure is obsessively describing how the body moves through space.

Takeaways

LTX-2 prompt engineering is fundamentally “writing a film storyboard in natural language.” The horror template forces structured output from the LLM. The cinematic principles maximize individual shot quality. The continuity controls connect shots into a narrative. All three layers are required for a production-ready pipeline. The future of video generation AI depends not on single-shot beauty, but on the intelligence embedded in the gaps between consecutive shots.

Reproduction Steps

Minimum Setup

LTX-2 video generation environment (ComfyUI or API)
Scenario generation LLM (local or API)
ffmpeg (clip concatenation and telop compositing)

Horror Scenario Generation Workflow

Select a Story Flavor (or provide free-form input)
Feed the system prompt template from this article to the LLM
Retrieve SCENES_36 and LTX2_PROMPTS_36 from output
Feed each shot’s PROMPT to LTX-2 sequentially
Concatenate generated clips with ffmpeg
Composite audio and telop downstream

Cinematic Quality Checklist

Camera position specified at the start of each prompt
Lighting and color palette consistent across scenes
Actions described step-by-step with arrows (→)
CONTINUITY defined between adjacent shots
Audio directives included
Face close-ups avoided; hands/backs/silhouettes preferred

LTX-2 Video Generation Prompt Engineering: From 36-Scene Horror to Cinematic Continuity Pipelines

Summary link

Prerequisites link

1. 36-Scene Horror Scenario Generation Template link

Design Intent link

Output Schema (Fixed) link

Scene Allocation link

Required Elements Per Scene link

Setting Rules link

Story Flavors (Presets) link

LTX2_PROMPTS_36 Shot Definition link

2. Cinematic Prompt Design Principles link

Shot-First Thinking link

Environment Anchoring (Visual Mood) link

Action as Continuous Physical Sequence link

Character Definition Through Behavior link

Camera Movement as Narrative Tool link

Audio Is Mandatory link

Prompt Structure Template link

3. Multi-Scene Continuity Control link

The Problem link

Last-Frame Continuity link

CONTINUITY Field in Practice link

Avoiding Teleportation link

Takeaways link

Reproduction Steps link

Minimum Setup link

Horror Scenario Generation Workflow link

Cinematic Quality Checklist link