Data advance preparation factors affecting results of sequence rule analysis in web log mining

One of the main tasks of web log mining is discovering patterns of behaviour of portal visitors. Based on the found patterns of users behaviour, which are represented by sequence rules it is possible to modify and improve the web page of an organisation. This article aims at finding out by means of an experiment to what degree it is necessary to realize data preparation for web log mi- ning and it aims also at specifying inevitable steps for obtaining valid data from the log file. Results of the experiment are very important for the portal, which is regularly analysed and modified, since they can prove correctness of individual steps at analysis, or through an identification of “usele- ss” steps they can make the advance preparation of data simpler. These results show that data cleaning from crawlers accesses has a significant impact on the quantity of extracted rules only in case, when we use the method of paths completion. On the contrary, the impact on the reduction of the portion of inexplicable rules as well as the impact on the quality of extracted rules in terms of their basic characteristics was not proved. Paths completing was proved crucial in data prepa- ration for web log mining. It was proved that paths completing has a significant impact both on the quantity and the quality of extracted rules. However, it was prov ed that allowing the used browser upon identifying sessions has neither any significant impact on the quantity nor on the quality of extracted rules. There exist a number of models for identification of users sessions, which are cru- cial in data preparation, however, there e xists also a method, which identifies them expressly. Our next goal is to additionally programme this functionality into the existing system and analyse various parameters of individual methods of identification of sessions compared with the reference direct identification. It also mentions the necessity to pay attention to the analysis of web logs in the real time and to reduce the time needed for the advance preparation of these logs and at the same time to increase accuracy of these data depending on the time of their collection.