Date
Publisher
arXiv
Interactive and spatially aware technologies are transforming educational
frameworks, particularly in K-12 settings where hands-on exploration fosters
deeper conceptual understanding. However, during collaborative tasks, existing
systems often lack the ability to accurately capture real-world interactions
between students and physical objects. This issue could be addressed with
automatic 6D pose estimation, i.e., estimation of an object's position and
orientation in 3D space from RGB images or videos. For collaborative groups
that interact with physical objects, 6D pose estimates allow AI systems to
relate objects and entities. As part of this work, we introduce FiboSB, a novel
and challenging 6D pose video dataset featuring groups of three participants
solving an interactive task featuring small hand-held cubes and a weight scale.
This setup poses unique challenges for 6D pose because groups are holistically
recorded from a distance in order to capture all participants -- this, coupled
with the small size of the cubes, makes 6D pose estimation inherently
non-trivial. We evaluated four state-of-the-art 6D pose estimation methods on
FiboSB, exposing the limitations of current algorithms on collaborative group
work. An error analysis of these methods reveals that the 6D pose methods'
object detection modules fail. We address this by fine-tuning YOLO11-x for
FiboSB, achieving an overall mAP_50 of 0.898. The dataset, benchmark results,
and analysis of YOLO11-x errors presented here lay the groundwork for
leveraging the estimation of 6D poses in difficult collaborative contexts.
What is the application?
Study design
