A Study on Classification in Imbalanced and Partially-Labelled Data Streams

The domain of radio astronomy is currently facing significant computational challenges, foremost amongst which are those posed by the development of the world's largest radio telescope, the Square Kilometre Array (SKA). Preliminary specifications for this instrument suggest that the final design will incorporate between 2000 and 3000 individual 15 metre receiving dishes, which together can be expected to produce a data rate of many TB/s. Given such a high data rate, it becomes crucial to consider how this information will be processed and stored to maximise its scientific utility. In this paper, we consider one possible data processing scenario for the SKA, for the purposes of an all-sky pulsar survey. In particular we treat the selection of promising signals from the SKA processing pipeline as a data stream classification problem. We consider the feasibility of classifying signals that arrive via an unlabelled and heavily class imbalanced data stream, using currently available algorithms and frameworks. Our results indicate that existing stream learners exhibit unacceptably low recall on real astronomical data when used in standard configuration, however, good false positive performance and comparable accuracy to static learners, suggests they have definite potential as an on-line solution to this particular big data challenge.

[1]  F. Camilo,et al.  The Parkes multi-beam pulsar survey - I. Observing and data analysis systems, discovery and timing of 100 pulsars , 2001, astro-ph/0106522.

[2]  M. Mclaughlin,et al.  The Parkes Multibeam Pulsar Survey - V. Finding binary and millisecond pulsars , 2004, astro-ph/0408228.

[3]  S. Burke-Spolaor,et al.  The High Time Resolution Universe Pulsar Survey - I. System configuration and initial discoveries , 2010, 1006.5744.

[4]  T. J. W. Lazio,et al.  Pulsars as tools for fundamental physics & astrophysics , 2004, astro-ph/0505555.

[5]  Lister Staveley-Smith,et al.  I science with the Square Kilometre Array , 2015 .

[6]  T. Joseph W. Lazio,et al.  The Commensal Real-Time ASKAP Fast-Transients (CRAFT) Survey , 2010 .

[7]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[8]  Massih-Reza Amini,et al.  Learning Classification with Both Labeled and Unlabeled Data , 2002, ECML.

[9]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[10]  Xuegang Hu,et al.  Learning from concept drifting data streams with unlabeled data , 2012, Neurocomputing.

[11]  Haibo He,et al.  Towards incremental learning of nonstationary imbalanced data stream: a multiple selectively recursive approach , 2011, Evol. Syst..

[12]  M. Mclaughlin,et al.  The Parkes Multibeam Pulsar Survey - VI. Discovery and timing of 142 pulsars and a Galactic population analysis , 2006, astro-ph/0607640.

[13]  F. Camilo,et al.  Discovery of 28 pulsars using new techniques for sorting pulsar candidates , 2009, 0901.3570.

[14]  Stuart J. Russell,et al.  Online bagging and boosting , 2005, 2005 IEEE International Conference on Systems, Man and Cybernetics.

[15]  Li Guo,et al.  Mining Data Streams with Labeled and Unlabeled Training Examples , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[16]  R. Edwards,et al.  The Swinburne intermediate-latitude pulsar survey , 2001, astro-ph/0105126.

[17]  H. Falcke,et al.  Probing the dark ages with the Square Kilometer Array , 2004 .

[18]  S. Bates Surveys of the galactic plane for pulsars , 2011 .

[19]  D. Champion,et al.  Application of the Gaussian mixture model in pulsar astronomy – pulsar classification and candidates ranking for the Fermi 2FGL catalogue , 2012, 1205.6221.

[20]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[21]  Geoff Holmes,et al.  New Options for Hoeffding Trees , 2007, Australian Conference on Artificial Intelligence.

[22]  Wee Keong Ng,et al.  Concurrent Semi-supervised Learning of Data Streams , 2011, DaWaK.

[23]  R. P. Eatough,et al.  Selection of radio pulsar candidates using artificial neural networks , 2010, 1005.5068.

[24]  A. J. Faulkner,et al.  Pulsar searches and timing with the square kilometre array , 2009 .