Counting competing speakers in a timeframe - human versus computer

We propose an automated solution for computing the number of simultaneous active speakers within a timeframe. The method is studied in parallel with a perception experiment realized with the help of 28 volunteers that were asked to detect how many speakers talk simultaneously in several recordings with variable length. For this study we focus on how listening time and the usage of familiar voices in the recordings impact the correct detection ratio. Regarding the automated method we discuss the influence of noise and the evolution of detection error determined by the speech duration. We observe that when capturing clean speech sources, the method is 76% accurate even for 10 simultaneous speakers, considering speech lengths longer than 3.5 seconds. The volunteers did not systematically detect correctly more than 4 competing speakers even when listening up to 80 seconds.