metadata
license: apache-2.0
datasets:
- allenai/SAGE-MM-RL-7k
- allenai/SAGE-MM-SFT-417K
language:
- en
base_model:
- allenai/SAGE-MM-Qwen2.5-VL-7B-SFT
pipeline_tag: video-text-to-text
- GitHub Repo: https://github.com/allenai/SAGE
- Project Page: https://praeclarumjj3.github.io/sage/
System Capabilities
SAGE-MM operates as the core decision-maker within the SAGE system. It functions in two distinct stages:
- Stage-1 (Context VLM): The model analyzes initial sampled frames and metadata to determine if the query can be answered immediately ("single-turn") or if it requires tool usage ("multi-turn").
- Stage-2 (Iterative Reasoner): If tools are needed, the model enters a loop where it calls tools, analyzes their output, and updates its context until a final answer is derived.
Supported Tools
The model is trained to generate JSON-formatted actions to invoke the following tools:
web-search: Search the internet for external knowledge (e.g., sports standings, cast lists).transcribe-speech: Perform ASR on specific timestamped segments of the video.ground-event: Locate start/end timestamps for specific visual events.extract-video-parts: Extract high-resolution frames or subclips from specific timestamps.analyze: Perform detailed visual analysis on extracted media.
Usage
Note: SAGE-MM outputs JSON action strings. It requires a runtime environment (provided in our GitHub repo) to parse these strings, execute the tools, and feed the observation back to the model.
License
This model is licensed under Apache 2.0. It is intended for research and educational use in accordance with Ai2's Responsible Use Guidelines.