Cross-Modal Matching of Text , Image and Symbolic Music Data