Efficient data-driven behavior identification based on vision transformers for human activity understanding