Web-Log Preparation with WUMprep
From HypKNOWsys
Contents |
Project Overview
WUMprep is a collection of Perl scripts supporting data preparation for data mining Web server log files. It's primary purpose is to be used in conjunction with the Web usage miner WUM, but WUMprep might also be used standalone or in conjunction with other tools for Web log analysis.
What WUMprep can do for you
Prototypically, preparing Web server log files for mining with WUM requires the following steps:
- Conversion of the log file into the "extended cookie" format
- Removal of irrelevant requests
- Removal of duplicate requests
- Optional: Try to resolve host IP addresses into hostnames
- Definition of sessions
- Removal of robot requests
- Map URLs onto abstract concept labels (conceptual scaling)
- Application of specific data preparation
Each of these steps is supported by certain Perl scripts, each of them having its own inline-documentation, explaining the usage and the underlying algorithms in greater detail. It can be accessed by invoking the command perldoc script.pl on the command line, where script.pl is replaced with the Perl scrip's file name. (Please note that you have to specify the script's complete path if the script directory is not contained in the PATH environment variable.)
All options and parameters for the WUMprep scripts are stored in a file called wumprep.conf. An example of this file is included in the directory containing the WUMprep Perl scripts. This template is well documented and should be self-explaining. The configuration file is expected to reside in the directory containing the log files to be processed.
Remove irrelevant requests
The idea behind the WUM mining model is to analyze usage patterns. For this purpose, we are interested in information about the paths visitors take when traversing a Web site, as is included in Web server log les. These log files not only contain requests to the pages comprising the Web site, but also requests of images, scripts etc. embedded in these pages. These "secondary" requests are not needed for the analysis and thus irrelevant - they must be removed from the logs before mining.
The script logFilter.pl is designed to perform this data cleaning task.
Remove duplicate requests
If a network connection is slow or a server's respond time is low, a visitor might issue several successive clicks on the same link before the requested page is finally showed in his browser. Those duplicate requestes are noise in the date and should be removed.
This is the script's logFilter.pl second job. It detects such duplicates in the log and drops all but the first occurences.
Resolve hosts' IP addresses
Depending on the Web server con guration, either a host's IP address or its hostname is logged. For data preparation purposes, knowing the hostnames has some advantages about working with IP addresses. For example, many proxy servers of major internet service providers identify themselfes as proxies in their hostnames. Those log entries could be removed to improve the accuracy of the data mining results when user identification relies on hostnames.
Most IP addresses can be resolved to hostnames with appropriate DNS queries. This job is done by the script dnsLookup.pl.
Define sessions
For further data preparation and data mining tasks, it is neccessary to divide log les into user sessions. A session is a contiguous series of requests from a single host. Multiple sessions of the same host can be divided by measuring a maximal page view time for a single page, using a user/session identifcation cookie or de ning one or more pages as "session-starters".
In the WUMprep suite, sessionize.pl is the script that supports this task. It prefixes each host field in the log with a session identifyer. For details about the criteria used for session identi cation, please resort to the script's inline documentation.
Remove robot sessions
On many Websites, a significant fraction of the requests stem from robots, indexers, spiders or agents. Since these requests are generated automatically, their traces in the log file do not represent human browsing behaviour and thus adulterate mining results.
To distinguish between human users and hosts that are robots, there exist several heuristics. They are implemented in the script removeRobots.pl and desribed in the script's inline documentation.
Conceptual scaling
For many analysis techniques, working with the raw URLs from Web log files will only lead to poor, if any, valuable findings. As for most online analytical processing (OLAP) tasks, the ability to abstract from raw data to generalizing and aggregating concepts from the domain of analysis is a necessity.
In WUMprep, mapReTaxonomies.pl is the tool that solves this problem. The script applies regular expression based matching rules to the log file and replaces the URLs of requested pages by appropriate concept labels you have defined in the rules file.
Further data preparation tasks
The data preparation steps described so far can be viewed as "generic" ones, applying to most Web usage mining tasks. Now, any irrelevant or disturbing data have been removed and the logs are divided into single user sessions.
What follows now is application specific data preparation, for which no generic algorithms are provided by WUMprep. However, you might use WUMprep as a starting point for deriving your own, customized log file preparation tools. And of course your input to extending WUMprep will be very welcome.
Download
Stable release
Get the current stable release from SourceForge.
Development release
The current development release is 0.11.0. WUMprep becomes currently extended and partly rewritten in order to make it work with WUMprep4Weka. Although it should be possible in principle to maintain backward compatibility with WUMprep release 0.10.0, it is not tested at the moment and might temporarily break during the development process. Therefore, any bug fixe releases of the 0.10.0 version will be published as 0.10.x.
Resources
- AccessLog.txt: Example log file in Common Logfile Format that you can use for your first steps with WUMprep. Please note: This log has dummy URLs, you cannot use it for testing the dnsLookup.pl script.
Selected publications
- Gebhard Dettmar. Logfile Preprocessing Using WUMprep. Talk given at the Web Mining Seminar in Winter semester 2003/04, School of Business and Economics, Humboldt University Berlin, Berlin, 2003. (PDF File)
- Gebhard Dettmar. Knowledge Discovery in Databases, Teil II - Web Mining., Online-publication on Community of Knowledge, 2003-06-15. (In German) [1]
- Gebhard Dettmar. Knowledge Discovery in Databases, Teil III: Konzept Hierarchien in WUMprep., Online-publication on Community of Knowledge, 2004-04-02. (In German) [2]
- Carsten Pohle and Myra Spiliopoulou. Building and Exploiting Ad Hoc Concept Hierarchies for Web Log Analysis. In: Data Warehousing and Knowledge Discovery, Proceedings of the 4th International Conference, DaWaK 2002, Aix-en-Provence, France, September 4-6, 2002, ed. by Y. Kambayashi, W. Winiwarter and M. Arikawa, Vol. 2454 of Lecture Notes in Computer Science, Berlin: Springer Verlag, 2002.- pp. 83-93

