allenai
/

SAGE-MM-Qwen2.5-VL-7B-SFT_RL

Video-Text-to-Text

Model card Files Files and versions

praeclarumjj3 commited on 8 days ago

Commit

0c8f333

·

verified ·

1 Parent(s): 9d2fad0

Update README.md

Files changed (1) hide show

README.md +3 -15

README.md CHANGED Viewed

@@ -14,18 +14,8 @@ pipeline_tag: video-text-to-text
 <img src="https://github.com/allenai/SAGE/blob/main/assets/sage.png" alt="SAGE Teaser" width="800"/>
 </div>
-## Model Details
-*   **Developed by:** SHI Labs @ Georgia Tech, Allen Institute for AI (AllenAI), University of Washington
-*   **Model Type:** Multimodal Agent Orchestrator / MLLM
-*   **Base Architectures:**
-    *   Qwen2.5-VL-7B-Instruct
-    *   Qwen3-VL-4B-Instruct
-    *   Qwen3-VL-8B-Instruct
-*   **Language(s):** English
-*   **License:** Apache 2.0 (Subject to base model license constraints)
-*   **Paper:** [SAGE: Training Smart Any-Horizon Agents for Long Video Reasoning with Reinforcement Learning](https://arxiv.org/abs/25xx.xxxxx)
-*   **Code Repository:** [https://github.com/allenai/SAGE](https://github.com/allenai/SAGE)
 ## System Capabilities
@@ -42,9 +32,7 @@ The model is trained to generate JSON-formatted actions to invoke the following
 *   `ground-event`: Locate start/end timestamps for specific visual events.
 *   `extract-video-parts`: Extract high-resolution frames or subclips from specific timestamps.
 *   `analyze`: Perform detailed visual analysis on extracted media.
-*   **Long Video Expert:** Achieves an **8.2% improvement** on videos longer than 10 minutes compared to direct inference.
-*   **Efficiency:** Despite being agentic, the inference runtime is roughly **8.6s/sample**, comparable to standard VLMs processing 512 frames, but with significantly higher accuracy.
 ## Usage
 **Note:** SAGE-MM outputs JSON action strings. It requires a runtime environment (provided in our [GitHub repo](https://github.com/allenai/SAGE)) to parse these strings, execute the tools, and feed the observation back to the model.

 <img src="https://github.com/allenai/SAGE/blob/main/assets/sage.png" alt="SAGE Teaser" width="800"/>
 </div>
+*   **GitHub Repo:** [https://github.com/allenai/SAGE](https://github.com/allenai/SAGE)
+*   **Project Page:** [https://praeclarumjj3.github.io/sage/](https://praeclarumjj3.github.io/sage/)
 ## System Capabilities
 *   `ground-event`: Locate start/end timestamps for specific visual events.
 *   `extract-video-parts`: Extract high-resolution frames or subclips from specific timestamps.
 *   `analyze`: Perform detailed visual analysis on extracted media.
 ## Usage
 **Note:** SAGE-MM outputs JSON action strings. It requires a runtime environment (provided in our [GitHub repo](https://github.com/allenai/SAGE)) to parse these strings, execute the tools, and feed the observation back to the model.