The use for a readability classification model is mainly as an integrated part of an information retrieval system. By matching the user's demands of readability to the documents with the corresponding readability, the classification model can further improve the results of, for example, a search engine. This thesis presents a new solution for classification into readability levels for Swedish. The results from the thesis are a number of classification models. The models were induced by training a Support Vector Machines classifier on features that are established by previous research as good measurements of readability. The features were extracted from a corpus annotated with three readability levels. Natural Language Processing tools for tagging and parsing were used to analyze the corpus and enable the extraction of the features from the corpus. Empirical testings of different feature combinations were performed to optimize the classification model. The classification models render a good and stable classification. The best model obtained a precision score of 90.21\% and a recall score of 89.56\% on the test-set, which is equal to a F-score of 89.88.
[1]
J. Chall,et al.
A FORMULA FOR PREDICTING READABILITY
,
1948
.
[2]
George R. Klare,et al.
The measurement of readability
,
1963
.
[3]
Douglas Biber,et al.
Variation across speech and writing: Methodology
,
1988
.
[4]
Vladimir Vapnik,et al.
Statistical learning theory
,
1998
.
[5]
J. Platt.
Sequential Minimal Optimization : A Fast Algorithm for Training Support Vector Machines
,
1998
.
[6]
B. Dahlqvist.
The SCARRIE Swedish Newspaper Corpus
,
1999
.
[7]
Thorsten Brants,et al.
TnT – A Statistical Part-of-Speech Tagger
,
2000,
ANLP.
[8]
Kevyn Collins-Thompson,et al.
A Language Modeling Approach to Predicting Reading Difficulty
,
2004,
NAACL.
[9]
W. Bruce Croft,et al.
Automatic recognition of reading levels from user queries
,
2004,
SIGIR '04.
[10]
Vipin Kumar,et al.
Introduction to Data Mining
,
2022,
Data Mining and Machine Learning Applications.