Data advance preparation factors affecting results of sequence rule analysis in web log mining
暂无分享,去创建一个
One of the main tasks of web log mining is discovering patterns of behaviour of portal visitors.
Based on the found patterns of users behaviour, which are represented by sequence rules it is
possible to modify and improve the web page of an organisation. This article aims at finding out by
means of an experiment to what degree it is necessary to realize data preparation for web log mi-
ning and it aims also at specifying inevitable steps for obtaining valid data from the log file. Results
of the experiment are very important for the portal, which is regularly analysed and modified, since
they can prove correctness of individual steps at analysis, or through an identification of “usele-
ss” steps they can make the advance preparation of data simpler. These results show that data
cleaning from crawlers accesses has a significant impact on the quantity of extracted rules only in
case, when we use the method of paths completion. On the contrary, the impact on the reduction
of the portion of inexplicable rules as well as the impact on the quality of extracted rules in terms
of their basic characteristics was not proved. Paths completing was proved crucial in data prepa-
ration for web log mining. It was proved that paths completing has a significant impact both on the
quantity and the quality of extracted
rules. However, it was prov
ed that allowing the used browser
upon identifying sessions has neither any significant impact on the quantity nor on the quality of
extracted rules. There exist a number of models for identification of users sessions, which are cru-
cial in data preparation,
however, there e
xists also a method, which identifies them expressly. Our
next goal is to additionally programme this functionality into the existing system and analyse various
parameters of individual methods of identification of sessions compared with the reference direct
identification. It also mentions the necessity to pay attention to the analysis of web logs in the real
time and to reduce the time needed for the advance preparation of these logs and at the same time
to increase accuracy of these data depending on the time of their collection.