Constructing a speech audio–video corpus by aligning long segments of speech and text