Text Classification Using Mahout

After successful completion of this module, students will be able to do the following: a. explain and apply methods of text classification b. correctly classify a set of documents using Apache Mahout c. construct and apply workflows for text classification using Apache Mahout 4. 5S characteristics of the module a) Streams: Streams include data and different types of results, such as text file providing classification information, etc. b) Structures: Sample document collections distributed with Apache Mahout are in CSV format. Other formats can be used after preprocessing. c) Spaces: The indexed documents and queries are logically represented as vectors in a vector space. The document collections are physically stored on the server running LucidWorks. d) Scenarios: Scenarios include users classifying a collection of documents into different classes. As an example, a collection of emails can be classified as spam or non-spam. e) Society: This module can be used by those who intend to classify a collection of documents. Potential end users are researchers, librarians, or any people who wish to learn Apache Mahout for text classification. 5. Level of effort required The required amount of time to complete this module should be about 4-5 hours. a. Out-of-class: 4-5 hours b. In-class: Students may ask questions and discuss exercises with their teammates. 6. Relationships with other modules (flow between modules) The module is not directly related to any other previous modules. The classification results from Apache Mahout can be used in different existing tools (e.g., NLTK) for information retrieval and analysis.

[1]  Suzanne Daneau,et al.  Action , 2020, Remaking the Real Economy.

[2]  Ed Zintel,et al.  Resources , 1998, IT Prof..