coursera_2026_06
SitePoint Premium
Stay Relevant and Grow Your Career in Tech
  • Premium Results
  • Publish articles on SitePoint
  • Daily curated jobs
  • Learning Paths
  • Discounts to dev tools
Start Free Trial

7 Day Free Trial. Cancel Anytime.

The term "vibe coding" entered the developer lexicon in February 2025 when Andrej Karpathy described a workflow where programmers lean heavily on AI to generate code. Audio-visual vibe coding pushes this further still: instead of describing what to build or showing a static image, developers record their screen, walk through a UI, narrate what they want, and hand the entire video to a model that watches, listens, reasons about temporal interactions, and generates working code.

How to Write Code from Video Using Audio-Visual Vibe Coding

  1. Record a screen capture of the target UI at 720p or higher, using slow, deliberate mouse movements and optional audio narration describing desired behavior.
  2. Install the DashScope SDK (pip install "dashscope>=1.14.0") and set your DASHSCOPE_API_KEY environment variable.
  3. Encode the video file as base64 (or upload to Alibaba Cloud OSS for files over 20 MB) and construct a multimodal message with a system prompt specifying the target framework and output format.
  4. Send the video and text prompt to the Qwen2.5-Omni model via MultiModalConversation.call(), allowing 30–90 seconds for processing.
  5. Extract fenced code blocks from the model's markdown response using regex, preferring blocks containing a complete HTML document structure.
  6. Review the generated HTML, CSS, and JavaScript for correctness, missing error handling, and accessibility before opening in a browser.
  7. Iterate by recording a follow-up video showing desired changes and appending it to the conversation history for multi-turn refinement within the 256K context window.

Table of Contents

What Is Audio-Visual Vibe Coding?

The Evolution from Text Prompts to Video Input

The term "vibe coding" entered the developer lexicon in February 2025 when Andrej Karpathy described a workflow where programmers lean heavily on AI to generate code, guiding the process through high-level intent rather than line-by-line specification. In its original form, vibe coding meant typing loose natural language prompts and letting a large language model handle implementation details. That was the text era. Developers then started feeding screenshots and wireframes to multimodal models, receiving functional code in return. Audio-visual vibe coding pushes this further still: instead of describing what to build or showing a static image, developers record their screen, walk through a UI, narrate what they want, and hand the entire video to a model that watches, listens, reasons about temporal interactions, and generates working code.

This removes the specification step: the developer demonstrates instead of describing. The model decomposes layout, identifies components, infers interaction logic, and generates code all at once. Qwen2.5-Omni, released by Alibaba's Qwen team, is the model that makes this workflow practical. Its architecture was purpose-built for joint audio-visual understanding, and its scores on multimodal reasoning benchmarks like OmniBench (see the Qwen2.5-Omni technical report, Table 5 for specific results) back up that design choice.

This removes the specification step: the developer demonstrates instead of describing. The model decomposes layout, identifies components, infers interaction logic, and generates code all at once.

What This Tutorial Builds

This tutorial walks through the complete workflow: recording a screen capture of a UI, sending it to Qwen2.5-Omni, and receiving functional HTML, CSS, and JavaScript output. It covers two paths, one using Alibaba's DashScope cloud API and another using HuggingFace Transformers for local inference.

Prerequisites: Python 3.10 or later (verify with python --version), a DashScope API key (a free tier is available; verify current availability and quotas at https://dashscope.console.aliyun.com/billing), and basic familiarity with making API calls in Python. For the local deployment path, a machine with 80GB or more of VRAM is necessary for the full-precision 7B model, though quantized variants can run on 24GB GPUs.

Qwen2.5-Omni Architecture at a Glance

Thinker-Talker Design with Hybrid-Attention MoE

Qwen2.5-Omni's architecture is split into two cooperating modules. The Thinker handles reasoning, code generation, and analytical tasks. It processes all input modalities, including video frames, audio waveforms, and text tokens, through a Hybrid-Attention Mixture of Experts (MoE) backbone. This MoE design routes different token types through specialized expert sub-networks rather than forcing all inputs through a single dense transformer. Vision-specialized experts process video frames. Audio experts handle audio channels. Language experts handle text tokens. A gating mechanism determines which experts activate for each input segment. (These routing descriptions are simplifications of the architecture described in the Qwen2.5-Omni technical report; consult the report for precise details on expert allocation.)

The Talker module handles speech synthesis. It takes the Thinker's reasoning output and produces natural-sounding spoken responses synchronized with the text output using a synchronization mechanism Alibaba calls ARIA. For code generation workflows, the Talker is less critical, but it enables scenarios where the model explains its code choices verbally while outputting them textually.

The model supports a 256K token context window (verify against the model card for the specific variant you are using). For video input, this translates to the ability to process several minutes of screen recording at reasonable frame rates without truncation, since video frames are tokenized and contribute to the context budget alongside any text prompt and audio track. The exact duration depends on frame rate and resolution; consult the model card for tokens-per-frame figures to calculate limits for your use case.

Key Capabilities That Matter for Developers

Speech recognition spans over 50 languages (verify the exact count on the model card), meaning narrated screen recordings in languages beyond English are viable input. Temporal reasoning across video frames lets the model detect UI interactions like clicks, scrolls, typing, and drag-and-drop sequences rather than treating each frame as an isolated image. The model identifies UI elements, infers spatial layout relationships, and recognizes action patterns, all capabilities required for generating code from a demonstrated interface.

How It Compares to GPT-4V and Gemini 1.5 Pro

CapabilityQwen2.5-OmniGPT-4V (gpt-4-turbo)Gemini 1.5 Pro
Video input supportNative, with temporal reasoningNo native video; requires manual frame extractionNative
Audio input supportNative, multilingualNot supportedNative
Max context length256K tokens128K tokens1M tokens
Speech generationYes (ARIA-synchronized)NoYes
Open weights availableYes (HuggingFace)NoNo
Audio-video joint reasoningYes, end-to-endNoYes
Multimodal understanding (OmniBench)See technical report, Table 5 for scoresBaselineComparable (see technical report for relative rankings)

Qwen2.5-Omni demonstrates competitive or superior results against Gemini 1.5 Pro in audio-visual joint reasoning tasks, according to the Qwen2.5-Omni technical report. Open-weights availability means developers can run the model locally, fine-tune it, and inspect its behavior in ways that closed models do not permit. Consult the Qwen2.5-Omni technical report for specific OmniBench scores and evaluation methodology.

Setting Up Your Environment

Option A: DashScope API (Recommended for This Tutorial)

The DashScope API is the fastest path to running Qwen2.5-Omni without local GPU resources. Install the SDK, configure an API key, and verify connectivity.

It is strongly recommended to use a virtual environment:

python -m venv qwen-vibe
source qwen-vibe/bin/activate  # On Windows: qwen-vibe\Scripts\activate

Then install and verify:

# Code Example 1: Install dependencies and verify DashScope API access
# Run first: pip install "dashscope>=1.14.0"

# Set your API key BEFORE launching Python:
#   export DASHSCOPE_API_KEY="sk-..."   # Linux/macOS
#   set DASHSCOPE_API_KEY=sk-...        # Windows CMD
# WARNING: Do not commit API keys to version control. Use environment variables or a .env file.

import os
import dashscope
from dashscope import MultiModalConversation

api_key = os.getenv("DASHSCOPE_API_KEY")
assert api_key, (
    "DASHSCOPE_API_KEY is not set. Export it in your shell before running."
)
dashscope.api_key = api_key


def extract_text(response, call_label="API call"):
    """Safely extract text from a DashScope MultiModalConversation response."""
    if response.status_code != 200:
        raise RuntimeError(
            f"{call_label} failed — status {response.status_code}: "
            f"{getattr(response, 'message', str(response))}"
        )
    try:
        choices = response.output.choices
        if not choices:
            raise ValueError("Response contained no choices.")
        content = choices[0].message.content
        if not content:
            raise ValueError("Response choice contained no content.")
        return content[0]["text"]
    except (AttributeError, IndexError, KeyError, TypeError) as exc:
        raise RuntimeError(
            f"{call_label} returned unexpected structure: {exc}"
        ) from exc


# Health check: minimal call to verify connectivity
response = MultiModalConversation.call(
    model="qwen2.5-omni",
    messages=[{"role": "user", "content": [{"text": "Hello, confirm you are online."}]}],
    timeout=120,
)

print("Status:", response.status_code)
print("Response:", extract_text(response, "health check"))

Option B: Local Deployment via HuggingFace Transformers

For developers with sufficient hardware, local deployment provides full control and avoids API rate limits. The full-precision 7B model requires approximately 80GB of VRAM (an A100 80GB or equivalent). Quantized versions using GPTQ or AWQ can fit on 24GB GPUs such as the RTX 4090, with some quality degradation. Verify the exact model repository slug and quantized variant IDs on the Qwen HuggingFace page before running.

Security warning: trust_remote_code=True executes arbitrary Python code downloaded from the model repository on HuggingFace Hub without sandboxing. Before running, review the model's repository files and pin to a specific commit hash using revision='<commit_sha>' to prevent silent updates. Check the model's HuggingFace page for the current transformers version requirement — if the model has been integrated into the core library, trust_remote_code may no longer be necessary. Verify with: python -c "from transformers import Qwen2_5OmniModel" — if this succeeds, the flag is not required.

# Code Example 2: Load Qwen2.5-Omni locally via HuggingFace Transformers
# Run first:
#   pip install "transformers>=4.45.0" "torch>=2.1.0" accelerate
#   pip install "flash-attn>=2.5.0" --no-build-isolation
#
# Note: flash-attn requires CUDA 11.6+ and a compatible C++ compiler.
# If installation fails, remove attn_implementation="flash_attention_2" below;
# the model will use standard attention with higher memory usage.

import torch
from transformers import Qwen2_5OmniModel, Qwen2_5OmniProcessor

# Verify exact model ID at https://huggingface.co/Qwen before running
model_name = "Qwen/Qwen2.5-Omni-7B"

# Pin to a specific commit hash to prevent silent code changes.
# Find the latest commit SHA on the model's HuggingFace page under "Files and versions".
PINNED_REVISION = "<commit_sha_from_huggingface>"  # e.g. "a1b2c3d"

processor = Qwen2_5OmniProcessor.from_pretrained(
    model_name,
    trust_remote_code=True,
    revision=PINNED_REVISION,
)

model = Qwen2_5OmniModel.from_pretrained(
    model_name,
    trust_remote_code=True,
    revision=PINNED_REVISION,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="flash_attention_2",  # Recommended for long-context video
)

if torch.cuda.is_available():
    first_device = next(model.parameters()).device
    print(f"Model first parameter on: {first_device}")
    print(f"Memory allocated: {torch.cuda.memory_allocated() / 1e9:.1f} GB")
else:
    print("CUDA not available; running on CPU (very slow for video input)")

# Note: For 24GB GPUs, use a quantized variant.
# Verify the exact quantized model ID at https://huggingface.co/Qwen
# before running, e.g.:
# model_name = "Qwen/Qwen2.5-Omni-7B-GPTQ-Int4"  # verify this slug exists
# and set torch_dtype=torch.float16

The flash_attention_2 implementation is strongly recommended for video inputs because standard attention becomes prohibitively slow at the sequence lengths generated by video frame tokenization.

Recording Your Input Video

What Makes a Good Source Video

Resolution should be at least 720p to ensure UI text and small elements are legible after the model's vision encoder processes the frames. Frame rates between 5 and 30 fps work well. Higher frame rates consume more context tokens without proportional quality gains for typical UI demonstrations. Lower frame rates risk missing brief interactions like button clicks.

For the DashScope API free tier, keep recordings under 3 minutes to manage token costs and processing time. The 256K context window supports longer videos, but token costs and processing time scale accordingly. Including audio narration is optional but valuable: the model processes both the visual and audio channels, and spoken descriptions of intent ("now I click add to create a new task") give the model explicit signals about desired behavior.

Preparing the Video File

Qwen2.5-Omni accepts MP4, WebM, and MOV formats (verify accepted formats against the current DashScope multimodal API documentation). For API upload, keep file sizes reasonable; compressing to H.264 at a moderate bitrate (2-5 Mbps for 720p) strikes a good balance between visual clarity and upload speed. For local inference, larger files are fine since there is no upload bottleneck.

Screen recording tools like OBS Studio (cross-platform), the built-in macOS screen recorder (Cmd+Shift+5), or Windows Game Bar (Win+G) all produce suitable output without additional configuration.

Audio-Visual Vibe Coding: The Core Workflow

Step 1: Sending a Screen Recording to Qwen2.5-Omni

The DashScope API accepts multimodal messages where video content is passed alongside a text prompt. The following example sends a local MP4 file of a to-do app UI walkthrough with a prompt.

Note: The video upload method shown below uses a file URL format. Consult the DashScope multimodal API documentation for the current recommended approach to uploading video (e.g., via OSS URLs or a dedicated upload endpoint). If the API does not accept inline base64 data URIs for video, upload the file to Alibaba Cloud OSS first and pass the resulting URL. Large videos (over a few MB) encoded as base64 will significantly inflate request size.

# Code Example 3: Send a screen recording to Qwen2.5-Omni via DashScope
import os
import base64
import dashscope
from dashscope import MultiModalConversation

api_key = os.getenv("DASHSCOPE_API_KEY")
assert api_key, "Set DASHSCOPE_API_KEY environment variable first."
dashscope.api_key = api_key

# Encode the local video file for upload
# WARNING: For videos larger than a few MB, consider uploading to OSS and passing
# the URL instead. Base64 encoding inflates file size by ~33% and may exceed
# API payload limits. See DashScope docs for the recommended upload method.
video_path = "todo_app_walkthrough.mp4"
assert os.path.exists(video_path), f"Video file not found: {video_path}"

_MAX_INLINE_BYTES = 20 * 1024 * 1024  # 20 MB

video_stat = os.stat(video_path)
if video_stat.st_size > _MAX_INLINE_BYTES:
    raise RuntimeError(
        f"Video is {video_stat.st_size / 1e6:.1f} MB — exceeds inline limit. "
        "Upload to Alibaba Cloud OSS and pass the resulting URL instead."
    )

with open(video_path, "rb") as f:
    video_bytes = f.read()
video_b64 = base64.b64encode(video_bytes).decode("utf-8")
print(f"Video encoded: {len(video_b64) / 1024:.1f} KiB base64")


def extract_text(response, call_label="API call"):
    """Safely extract text from a DashScope MultiModalConversation response."""
    if response.status_code != 200:
        raise RuntimeError(
            f"{call_label} failed — status {response.status_code}: "
            f"{getattr(response, 'message', str(response))}"
        )
    try:
        choices = response.output.choices
        if not choices:
            raise ValueError("Response contained no choices.")
        content = choices[0].message.content
        if not content:
            raise ValueError("Response choice contained no content.")
        return content[0]["text"]
    except (AttributeError, IndexError, KeyError, TypeError) as exc:
        raise RuntimeError(
            f"{call_label} returned unexpected structure: {exc}"
        ) from exc


# Construct the multimodal message
messages = [
    {
        "role": "system",
        "content": [{"text": "You are an expert frontend developer. Generate clean, production-quality code."}],
    },
    {
        "role": "user",
        "content": [
            {"video": f"data:video/mp4;base64,{video_b64}"},
            {
                "text": (
                    "Watch this screen recording and generate the complete code for what you see. "
                    "Output a single HTML file with embedded CSS and JavaScript. "
                    "Include all UI components, layout, styling, and interaction handlers."
                )
            },
        ],
    },
]

# Make the API call
response = MultiModalConversation.call(
    model="qwen2.5-omni",
    messages=messages,
    timeout=120,
)

# Extract the generated content
result = extract_text(response, "video-to-code call")
print(result)

The system prompt is not optional filler. Even in a video-driven workflow, specifying the target framework, code style, and output format in the text portion of the message matters. The difference between "generate code" and "generate production-quality React code in a single file" is the difference between generic markup and structured, idiomatic output.

Step 2: Understanding the Model's Response

The response object contains the model's text output, which typically includes a markdown-formatted explanation followed by code blocks. When processing a UI walkthrough, the model decomposes the video into several layers of understanding: it identifies individual UI components (buttons, input fields, lists, navigation bars), infers their spatial layout and hierarchy, detects demonstrated interactions (clicks, scrolls, text input), and reasons about the temporal sequence to determine cause-and-effect relationships between actions.

When audio narration is present, the model integrates spoken descriptions with visual observations. A narration like "now I click the add button and a new task appears in the list" tells the model to wire up an event handler: the add button needs a click listener that appends an item to the task list. This joint audio-visual reasoning is where the Thinker-Talker architecture's design pays off, since both channels inform the same reasoning process rather than being handled independently.

When audio narration is present, the model integrates spoken descriptions with visual observations. A narration like "now I click the add button and a new task appears in the list" tells the model to wire up an event handler: the add button needs a click listener that appends an item to the task list.

Step 3: Extracting and Running the Generated Code

The model's response is typically markdown containing fenced code blocks. These need to be extracted and written to files:

# Code Example 4: Parse response, extract code blocks, and write to files
import re
import os


def extract_and_save_code(model_response, output_filename="index.html"):
    """Extract the best HTML/CSS/JS code block from the model's markdown response.

    Matches any fenced code block regardless of language label. Prefers blocks
    that look like complete HTML documents; falls back to the largest block.
    """
    # Match any fenced code block regardless of language label
    code_blocks = re.findall(
        r"```[^
]*
(.*?)```",
        model_response,
        re.DOTALL,
    )

    if not code_blocks:
        print("No fenced code blocks found in response. Raw response snippet:")
        print(model_response[:500])
        return None

    # Prefer the first block that looks like a complete HTML document
    html_blocks = [b for b in code_blocks if re.search(r"<!DOCTYPE|<html", b, re.IGNORECASE)]
    code = (html_blocks[0] if html_blocks else max(code_blocks, key=len)).strip()

    # Write to file
    with open(output_filename, "w", encoding="utf-8") as f:
        f.write(code)

    print(f"Code written to {output_filename} ({len(code)} characters)")
    return output_filename


# Use the 'result' variable from Code Example 3
output_file = extract_and_save_code(result)

if output_file:
    abs_path = os.path.abspath(output_file)
    print(f"Generated app written to: {abs_path}")
    print("Review the file before opening in a browser — it contains LLM-generated JavaScript.")
    # To open after review: webbrowser.open(f"file://{abs_path}")

Step 4: Iterating with Follow-Up Video Clips

One of the most powerful aspects of this workflow is multi-turn iteration. The 256K context window means the model retains the full conversation history, including previously processed video frames, when receiving a second recording showing desired changes:

# Code Example 5: Multi-turn conversation with a second video for iteration
# Continuing from Code Example 3 — 'messages' and 'response' must be in scope.
# If running in a fresh session, re-run Code Example 3 first.

import os
import copy
import base64
from dashscope import MultiModalConversation

# Build the continuation without mutating the original messages list
turn_messages = copy.deepcopy(messages)

# Add the assistant's first response to conversation history
turn_messages.append({
    "role": "assistant",
    "content": response.output.choices[0].message.content,
})

# Encode a second video showing desired changes (e.g., drag-and-drop reordering)
# This file must exist in your working directory.
second_video_path = "todo_drag_and_drop.mp4"
assert os.path.exists(second_video_path), f"Video file not found: {second_video_path}"

_MAX_INLINE_BYTES = 20 * 1024 * 1024  # 20 MB

video2_stat = os.stat(second_video_path)
if video2_stat.st_size > _MAX_INLINE_BYTES:
    raise RuntimeError(
        f"Video is {video2_stat.st_size / 1e6:.1f} MB — exceeds inline limit. "
        "Upload to Alibaba Cloud OSS and pass the resulting URL instead."
    )

with open(second_video_path, "rb") as f:
    video2_b64 = base64.b64encode(f.read()).decode("utf-8")

# Add the follow-up request with the new video
turn_messages.append({
    "role": "user",
    "content": [
        {"video": f"data:video/mp4;base64,{video2_b64}"},
        {"text": "Now add the drag-and-drop reordering feature shown in this recording to the previous code."},
    ],
})

# Make the follow-up API call
response2 = MultiModalConversation.call(
    model="qwen2.5-omni",
    messages=turn_messages,
    timeout=120,
)
updated_code = extract_text(response2, "follow-up call")
extract_and_save_code(updated_code, "index_v2.html")

The model processes the second video in the context of the first, understanding both the existing code it generated and the new interactions being demonstrated. This makes incremental refinement feel conversational rather than requiring a fresh start each time.

Full Working Example: Screen Recording to Functional To-Do App

The scenario: a 45-second screen recording of a hand-drawn wireframe walkthrough showing a to-do application. The recording includes voice narration describing features: "This is the task input field at the top, here's the add button, tasks appear in a list below, each task has a checkbox to mark it complete and a delete button."

Before running: Ensure todo_app_walkthrough.mp4 exists in your working directory. Record it using OBS Studio, macOS screen recorder, or similar as described in the "Recording Your Input Video" section above.

# Code Example 6: Complete end-to-end pipeline — video to working app
import os
import re
import base64
import dashscope
from dashscope import MultiModalConversation

# --- Configuration ---
api_key = os.getenv("DASHSCOPE_API_KEY")
assert api_key, "Set DASHSCOPE_API_KEY environment variable first."
dashscope.api_key = api_key

VIDEO_PATH = "todo_app_walkthrough.mp4"  # Your screen recording
OUTPUT_FILE = "todo_app.html"
MODEL = "qwen2.5-omni"

_MAX_INLINE_BYTES = 20 * 1024 * 1024  # 20 MB


def extract_text(response, call_label="API call"):
    """Safely extract text from a DashScope MultiModalConversation response."""
    if response.status_code != 200:
        raise RuntimeError(
            f"{call_label} failed — status {response.status_code}: "
            f"{getattr(response, 'message', str(response))}"
        )
    try:
        choices = response.output.choices
        if not choices:
            raise ValueError("Response contained no choices.")
        content = choices[0].message.content
        if not content:
            raise ValueError("Response choice contained no content.")
        return content[0]["text"]
    except (AttributeError, IndexError, KeyError, TypeError) as exc:
        raise RuntimeError(
            f"{call_label} returned unexpected structure: {exc}"
        ) from exc


def extract_and_save_code(model_response, output_filename="index.html"):
    """Extract the best HTML/CSS/JS code block from a markdown response."""
    code_blocks = re.findall(
        r"```[^
]*
(.*?)```",
        model_response,
        re.DOTALL,
    )

    if not code_blocks:
        print("No fenced code blocks found. Raw response snippet:")
        print(model_response[:500])
        return None

    # Prefer the first block that looks like a complete HTML document
    html_blocks = [b for b in code_blocks if re.search(r"<!DOCTYPE|<html", b, re.IGNORECASE)]
    code = (html_blocks[0] if html_blocks else max(code_blocks, key=len)).strip()

    with open(output_filename, "w", encoding="utf-8") as f:
        f.write(code)

    print(f"Code written to {output_filename} ({len(code)} characters)")
    return output_filename


# --- Step 1: Encode the video ---
print(f"Reading video: {VIDEO_PATH}")
assert os.path.exists(VIDEO_PATH), f"Video file not found: {VIDEO_PATH}"

video_stat = os.stat(VIDEO_PATH)
if video_stat.st_size > _MAX_INLINE_BYTES:
    raise RuntimeError(
        f"Video is {video_stat.st_size / 1e6:.1f} MB — exceeds inline limit. "
        "Upload to Alibaba Cloud OSS and pass the resulting URL instead."
    )

with open(VIDEO_PATH, "rb") as f:
    video_bytes = f.read()

video_b64 = base64.b64encode(video_bytes).decode("utf-8")
print(f"Video encoded: {len(video_b64) / 1024:.1f} KiB base64")

# --- Step 2: Build the multimodal message ---
messages = [
    {
        "role": "system",
        "content": [{"text": (
            "You are an expert frontend developer. Generate a single, self-contained HTML file "
            "with embedded CSS and JavaScript. Use modern ES6+, semantic HTML5, and clean CSS. "
            "The app must be fully functional with no external dependencies."
        )}],
    },
    {
        "role": "user",
        "content": [
            {"video": f"data:video/mp4;base64,{video_b64}"},
            {"text": (
                "Watch this screen recording of a to-do app walkthrough. Generate the complete, "
                "working code for the application shown. Include all UI components, styling, layout, "
                "and interaction handlers demonstrated in the video. Listen to the audio narration "
                "for additional feature requirements."
            )},
        ],
    },
]

# --- Step 3: Call the API ---
print("Sending to Qwen2.5-Omni (this may take 30-90 seconds)...")
response = MultiModalConversation.call(model=MODEL, messages=messages, timeout=120)

result_text = extract_text(response, "video-to-code call")
print(f"Response received ({len(result_text)} characters)")

# --- Step 4: Extract code and write to file ---
output_file = extract_and_save_code(result_text, OUTPUT_FILE)

if not output_file:
    raise RuntimeError("Could not extract code from model response.")

# --- Step 5: Notify user ---
abs_path = os.path.abspath(output_file)
print(f"Generated app written to: {abs_path}")
print("Review the file before opening in a browser — it contains LLM-generated JavaScript.")
# To open after review: webbrowser.open(f"file://{abs_path}")

What the Model Got Right

In informal testing with five narrated screen recordings, Qwen2.5-Omni reliably identified standard UI components: input fields, buttons, list containers, checkboxes, and delete controls. Layout fidelity to the demonstrated interface held up for common patterns like top-bar-plus-list or sidebar-plus-content arrangements. Event handler logic inferred from demonstrated interactions, particularly add, delete, and toggle-complete actions, was functionally correct on the first pass in four of five recordings. Results will vary depending on recording quality, narration clarity, and UI complexity.

Where It Struggled (and How to Fix It)

Complex CSS animations are a common failure mode. In three of five test recordings that included CSS transitions or hover effects, the generated code either simplified the animation to a basic property change or omitted it entirely. Ambiguous gestures cause problems as well: a fast mouse movement between two elements might be interpreted as a drag operation when it was simply navigation.

Overlapping UI elements, particularly modals or dropdown menus that obscure underlying content, can confuse the model's spatial reasoning about component hierarchy.

Workarounds that improved results in each of our test recordings: add brief audio narration to clarify intent at ambiguous moments. Break complex UIs into shorter, focused recordings of individual features rather than one long walkthrough. Then use the multi-turn iteration workflow from Step 4 to refine specific aspects after the initial generation pass.

Tips for Better Results

Optimizing Your Video Input

Slow, deliberate mouse movements outperform fast navigation. When the cursor moves quickly, the model has fewer frames to establish the relationship between the pointer position and the target element. Zoom into key UI areas for detail-heavy components like form fields with specific placeholder text or icons with particular styling; this gives the vision encoder more pixels to work with per relevant element. Audio narration helps specify business logic the model cannot see: validation rules, API endpoint patterns, data persistence requirements, and edge case behaviors.

Prompt Engineering Still Matters

Even in a video-driven workflow, the text portion of the multimodal message shapes output quality. A bare video with no text prompt produces generic code. Adding a one-line system prompt that specifies the target framework (e.g., "Generate production-ready React code using functional components and hooks") produces structured, idiomatic output instead of generic markup. Specifying output format ("a single self-contained HTML file" vs. "separate files for HTML, CSS, and JS") prevents the model from guessing at project structure.

Limitations and Honest Assessment

This workflow produces prototype-grade code. The output is functional but not production-ready without review. Missing error handling, accessibility attributes, and edge case coverage are the norm, not the exception.

The 80GB VRAM requirement for local deployment of the full-precision model puts self-hosting out of reach for most individual developers. The DashScope API is the practical path for the majority of users. Video processing latency is non-trivial: in five informal tests via the API on default-tier accounts, response times ranged from 30 to 90 seconds for a 1-minute recording, varying with video complexity and API load.

This workflow produces prototype-grade code. The output is functional but not production-ready without review. Missing error handling, accessibility attributes, and edge case coverage are the norm, not the exception.

The model can and does hallucinate UI elements not present in the video, particularly when recordings are ambiguous or low-resolution. DashScope API costs for video input tokens are higher than text-only calls because video frames generate substantially more tokens per second of content than equivalent text descriptions would. Consult the DashScope pricing page for current rates.

Is Audio-Visual Vibe Coding the Future?

This tutorial demonstrated a complete pipeline: screen recording to API call to functional HTML, CSS, and JavaScript output, with multi-turn iteration for refinement. Qwen2.5-Omni's Thinker-Talker architecture and joint audio-visual reasoning represent a genuine capability jump over text-only or image-only code generation workflows. The approach is practical today for rapid prototyping, UI-to-code translation, and scenarios where describing an interface in text is harder than simply showing it.

The open question is whether this stays a prototyping trick or becomes a standard part of the development workflow. That depends on two things: whether model accuracy on complex UIs improves enough to reduce the manual cleanup cycle, and whether video-input token costs drop enough to make iterative use economically viable. Both are moving targets.

Matt MickiewiczMatt Mickiewicz

Matt is the co-founder of SitePoint, 99designs and Flippa. He lives in Vancouver, Canada.

© 2000 – 2026 SitePoint Pty. Ltd.
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.