Learning In Focus: Detecting Behavioral And Collaborative Engagement Using Vision Transformers

Authors

Sindhuja Penchala,

Saketh Reddy Kontham,

Prachi Bhattacharjee,

Nima Mahmoodi,

Daniel Fonseca,

Sareh Karami,

Mehdi Ghahremani,

Andy D. Perkins,

Shahram Rahimi,

Noorbakhsh Amiri Golilarz

Date

12/2025

Publisher

arXiv

Link

https://arxiv.org/pdf/2508.15782v2

In early childhood education, accurately detecting collaborative and behavioral engagement is essential to foster meaningful learning experiences. This paper presents an AI driven approach that leverages Vision Transformers (ViTs) to automatically classify children s engagement using visual cues such as gaze direction, interaction, and peer collaboration. Utilizing the ChildPlay gaze dataset, our method is trained on annotated video segments to classify behavioral and collaborative engagement states (e.g., engaged, not engaged, collaborative, not collaborative). We evaluated six state of the art transformer models: Vision Transformer (ViT), Data efficient Image Transformer (DeiT), Swin Transformer, VitGaze, APVit and GazeTR. Among these, the Swin Transformer achieved the highest classification performance with an accuracy of 97.58 percent, demonstrating its effectiveness in modeling local and global attention. Our results highlight the potential of transformer based architectures for scalable, automated engagement analysis in real world educational settings.

What is the application?

Analyzing

Who age?

0-3 years

Why use AI?

Efficiency,

Outcomes – Other Academic,

Outcomes – Social Emotional,

Outcomes – Durable Skills

Study design

Technical – Computational

Search and Filter

Submit a research study

Learning In Focus: Detecting Behavioral And Collaborative Engagement Using Vision Transformers