arxiv:2605.08158

HY-Himmel Technical Report: Hierarchical Interleaved Multi-stream Motion Encoding for Long Video Understanding

Published on May 4

Authors:

Abstract

HY-Himmel is a hierarchical video-language framework that improves long-video understanding by separating semantic and motion processing through sparse anchor I-frames and dense compressed-domain encoding, achieving better performance with fewer context tokens.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Long-video understanding with multimodal language models suffers from three compounding bottlenecks: heavy decode cost to obtain dense RGB frames, quadratic token growth with frame count, and weak motion perception under sparse keyframe sampling. We present HY-Himmel, a hierarchical video-language framework that allocates semantic and motion capacity separately. A small set of sparse anchor I-frames is routed to the expensive host ViT to ground object identity and scene layout, while the far denser inter-frame intervals are encoded by a lightweight compressed-domain tri-stream adapter that distils motion evidence from motion-vector maps, residual maps, and I-frame context into aligned motion tokens. These tokens are injected into the LLM via a differentiable placeholder mechanism after a dedicated Stage-1 contrastive alignment that places the motion representation in a geometry compatible with the frozen visual backbone. On Video-MME, HY-Himmel surpasses the dense 32-frame baseline by +2.3 pp (61.2 to 63.5%) while using 3.6x fewer context tokens. Extensive ablations over stream composition, motion encoder family, fusion mode, alignment objective, anchor count, LoRA rank, and video duration confirm that the full tri-stream is necessary and sufficient for the observed gains.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.08158

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.08158 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.08158 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.08158 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.