Integrating Vision Transformer-Based Bilinear Pooling and Attention Network Fusion of RGB and Skeleton Features for Human Action Recognition