MolmoPoint: Open-Source Multimodal Models with Precise Visual Grounding

MolmoPoint is a new open-source multimodal model from Allen Institute for AI (AI2) that introduces advanced pointing and clicking capabilities.
It utilizes a novel architecture that maps visual coordinates to text tokens, allowing the model to interact with user interfaces and physical environments with high precision.
The model outperforms several proprietary counterparts in tasks requiring spatial awareness and fine-grained visual grounding.
AI2 has released the model weights, training data, and evaluation benchmarks to promote transparency and open research in the AI community.

Entities: Allen Institute for AI (AI2, Molmo, MolmoPoint

The visibility toggle: draft as a first-class concept