I am experimenting with the V-JEPA model developed by Meta for video understanding.
My goal is to analyze a live video stream of people attending a seminar and determine their engagement level (for example: bored, attentive, or interested).
I would like to know whether V-JEPA can be used directly for this type of emotion or behavior analysis, or if it can only be used as a feature extractor that requires an additional classifier on top of it.
Has anyone tried using V-JEPA or similar JEPA models for audience engagement or emotion analysis from vid