E2-VOR: An End-to-End En/Decoder Architecture for Efficient Video Object Recognition