WUMprep4Weka
From HypKNOWsys
Introduction · Documentation · Versions & Download · Support
News
- 2005-10-30: WUMprep4Weka-0.9-alpha3 released
Improved usability of the WUMprep configuration editor. - 2005-10-27: WUMprep4Weka-0.9-alpha2 released
Most of the TODOs for release 0.9-beta1 are done. See the Changelog for details.
Contents |
Overview
WUMprep4Weka (speak: "WUMprep for Weka") integrates the WUMprep tools for Web log preparation into the Weka KnowledgeFlow graphical user interface. It provides an easy to use way of Web log processing, with full access to the powerful and performant WUMprep Perl scripts. WUMprep configuration can now easily be done via graphical dialogues with context help. Manual editing of WUMprep configuration files is no longer necessary.
WUMprep4Weka is available both as a standalone plugin for Weka that requires manual installation of Weka and WUMprep, and as a pre-packaged release of Weka (called "Weka-WP4W") that contains everything you need for using WUMprep4Weka but the Perl and Java runtime environments.
Features
The following Web mining preparation tasks are supported by the latest release:
- RequestFilter
- Remove single log lines matching certain criteria like path or extension of the requested file or server response code.
- Anonymize
- Anonymize a log file by replacing its original host addresses by random ones.
- DnsLookup
- Resolve hosts' IP addresses into their DNS names.
- Sessionize
- Apply heuristics or cookie data for grouping log lines into sessions.
- SessionFilter
- Remove complete sessions according to different criteria.
- DetectRobots
- Apply heuristics for classifying requests as robot- or human-sourced.
- Coneptualize
- Perform regular expression based conceptual scaling on the log by replacing request paths by abstract terms (i.e., an operation similar to a roll-up in a data warehouse).
Get Involved
The WUMprep4Weka project is currently looking for volunteers that would like to help lifting it from its current alpha stadium to the first stable level release. Please check the Roadmap for a detailed list of what has to be done for this.
Anybody who is a little familiar with both WUMprep and Weka can contribute. We will appreciate any reports about applications of WUMprep4Weka in your own (research) projects. You are also invited to contribute to the user documentation - just write down what you did in order to get your project done ;-).
In particular, we would be happy for input regarding:
- Bug reports
- Improving the installation, particularly the Perl modules required for WUMprep. Ideally, this should be automated in a setup routine covering at least the Windows, Linux and Mac platforms.
- User documentation, maybe in form of a project report.
- Usability improvements.
Please contact us whenever you have questions regarding WUMprep4Weka or want to contribute.
Screenshots
The following screenshots illustrate how WUMprep4Weka looks like and how easy it is (will be ;-)) to prepare your Web log files.
The Weka KnowledgeFlow interface with a WeblogLoader node for importing a Web log file into Weka.
Note how WUMprep4Weka nodes are integrated into node selector on the top of the screen.
The configuration dialogue of the WeblogLoader node. You can define the global WUMprep configuration settings and those required for importing a log file.
Note the help showing the documentation to the currently edited configuration option.
A complete KnowledgeFlow that loads a Web log file, resolves its IP addresses into DNS host names, displays the results in the TextViewer and writes the processed log into a new file via the WeblogSaver node.
The configuration dialogue for the WeblogSaver node.
Roadmap
Following is a list of planned "milestone releases" of WUMprep4Weka. There will probably be further releases published between the these milestones that don't and won't appear in this list.
Release 0.9-alpha1
Released: 2005-10-18
To-Do:
-
Installation docs -
Packaging
Release 0.9-beta1
Scheduled: TBA
To-Do:
- Make WUMprep scripts ARFF-aware
-
logFilter.pl(now requestFilter.pl) -
anonymize.pl -
dnsLookup.pl -
sessionize.pl -
sessionFilter.pl -
detectRobots.pl -
mapReTaxonomies.pl(now conceptualize.pl)
-
- Create nodes for the KnowledgeFlow interface
-
RequestFilter -
Anonymize -
DnsLookup -
Sessionize -
SessionFilter -
DetectRobots -
Coneptualize - Vectorize (transform a sessionized log containing one record per line into session vectors)
-
- Write an editor GUI for taxonomy definition files and make it accessible from the mapReTaxonomies' configuration dialogue.
Release 1.0
Scheduled: TBA
This will be the first stable release.
To-Do:
- Rewrite conceptualize.pl in order to not replace path data from the log file, but return conceptual scaling results as additional attributes
- User documentation
- Test the beta release within real-world applications
- Bug-fixing

