SAGE-MM-Qwen2.5-VL-7B-SFT_RL / README.md

praeclarumjj3

Update README.md

63930ee verified about 15 hours ago

preview code

raw

history blame contribute delete

1.99 kB

metadata

license: apache-2.0
datasets:
  - allenai/SAGE-MM-RL-7k
  - allenai/SAGE-MM-SFT-417K
language:
  - en
base_model:
  - allenai/SAGE-MM-Qwen2.5-VL-7B-SFT
pipeline_tag: video-text-to-text

GitHub Repo: https://github.com/allenai/SAGE
Project Page: https://praeclarumjj3.github.io/sage/

System Capabilities

SAGE-MM operates as the core decision-maker within the SAGE system. It functions in two distinct stages:

Stage-1 (Context VLM): The model analyzes initial sampled frames and metadata to determine if the query can be answered immediately ("single-turn") or if it requires tool usage ("multi-turn").
Stage-2 (Iterative Reasoner): If tools are needed, the model enters a loop where it calls tools, analyzes their output, and updates its context until a final answer is derived.

Supported Tools

The model is trained to generate JSON-formatted actions to invoke the following tools:

web-search: Search the internet for external knowledge (e.g., sports standings, cast lists).
transcribe-speech: Perform ASR on specific timestamped segments of the video.
ground-event: Locate start/end timestamps for specific visual events.
extract-video-parts: Extract high-resolution frames or subclips from specific timestamps.
analyze: Perform detailed visual analysis on extracted media.

Usage

Note: SAGE-MM outputs JSON action strings. It requires a runtime environment (provided in our GitHub repo) to parse these strings, execute the tools, and feed the observation back to the model.

License

This model is licensed under Apache 2.0. It is intended for research and educational use in accordance with Ai2's Responsible Use Guidelines.