ODBCMine is a data-mining tool that generates decision trees. It analyzes the data in ODBC data sources using the C4.5 algorithm, and creates graphical decision trees in Scalable Vector Graphics (SVG) format. It is intended to be a simple and inexpensive -- yet powerful -- implementation of a classic machine learning/data-mining technique.
ODBCMine is distributed as shareware. You can download the evaluation version and use it for 30 days. This version limits the number of rows (15) and columns (4) in tables that can be analyzed; otherwise it is identical to the registered version.
Immediate Order Fulfillment!
If you register ODBCMine online at RegNet you will immediately be sent an email with the link to the unrestricted version.
Sale Price!
A single-user license is just $49.95.
The theory behind ODBCMine is described in detail in the online article Building Classification Models: ID3 and C4.5 from Temple University, and in the article The ID3 Algorithm. We have implemented the basic ID3 algorithm described in these articles as a Win32 executable, and added logic for some of the C4.5 extensions as well as for ODBC connectivity. We've also provided the sample data cases from the first article (the StockMarket and Golf examples) as an MS Access 97 database.
Some possible applications of ODBCMine include credit risk analysis and customer behavior prediction. An interesting discussion of business data-mining applications can be read online in the CIO Magazine Data Mining Article. We also note that an earlier version of ODBCMine is used as a tool in the ECT 584 Web Data Mining graduate-level class at DePaul University.
For example, assume that you have kept a record of all the days in which you have gone golfing in the past year. You've included a record of the weather conditions on that day (was it windy, what was the temperature, what was the general outlook for clouds, etc.), and you've also recorded whether or not you actually did play under those conditions. You have stored your data as a table in a relational database:
| Outlook | Windy | Humidity | Temperature | Play |
|---|---|---|---|---|
| sunny | false | 85 | 85 | no |
| sunny | true | 90 | 80 | no |
| overcast | false | 78 | 83 | yes |
| rain | false | 96 | 70 | yes |
| rain | false | 80 | 68 | yes |
| rain | true | 70 | 65 | no |
| overcast | true | 65 | 64 | yes |
| sunny | false | 95 | 72 | no |
| sunny | false | 70 | 69 | yes |
| rain | false | 80 | 75 | yes |
| sunny | true | 70 | 75 | yes |
| overcast | true | 90 | 72 | yes |
| overcast | false | 75 | 81 | yes |
| rain | true | 80 | 71 | no |
Using ODBCMine, you can analyze this data to construct a decision tree (or algorithm) for predicting whether or not you should play on future days. To run an analysis, you must specify the name of the database table to analyze, the name of the goal column (the table column that represents the final decision desired), and the names of one or more columns that are factors in making the decision. For our example, the name of the table is golf, the goal for the final decision is play, and the decision factor columns are outlook, windy, and humidity. Given this input, ODBCMine will generate a decision tree showing how a decision on whether to play can be made:
For each decision result, ODBCMine calculates a confidence factor. A factor of 100% indicates that the decision path should always be correct, based on the training set data. Lesser values imply errors; the factor is the percentage of values in the training set classified correctly by that path of the tree.
ODBCMine can also display the decision tree as a set of IF-THEN rules:
if outlook = overcast then play = yes (100%)
if outlook = rain and windy = false then play = yes (100%)
if outlook = rain and windy = true then play = no (100%)
if outlook = sunny and humidity <= 70 then play = yes (100%)
if outlook = sunny and humidity > 70 then play = no (100%)
As another example, consider a table of stock market profits based on the age of the company, whether the company has competition, and the market sector (hwr for hardware, swr for software):
| Age | Competition | Type | Profit |
|---|---|---|---|
| old | yes | swr | down |
| old | no | swr | down |
| old | no | hwr | down |
| mid | yes | swr | down |
| mid | yes | hwr | down |
| mid | no | hwr | up |
| mid | no | swr | up |
| new | yes | swr | up |
| new | no | hwr | up |
| new | no | swr | up |
Here is the resulting decision tree. This example shows the tree drawn in vertical orientation instead of the default horizontal orientation:
The C4.5 algorithm used in ODBCMine can analyze columns that have discrete values ('low','high', etc.) as well as numeric values with continuous ranges. However, analyzing continuous columns is significantly more expensive, and requires that ODBCMine issue a query for each distinct numeric value found in the table. Internally, ODBCMine will partition the continuous range into two discrete sets: less than or equal to a breakpoint, and greater than a breakpoint. ODBCMine will choose the breakpoint that gives the best decision tree possible.
You can manually convert continuous columns into discrete columns using the IIF() function provided by many databases (including MS Access). This function has the form:
IIF(expression,value-if-true,value-if-false)
Instead of using the column value directly, the DBMS will evaluate the expression, and if the expression is true, will substitute the "value-if-true" value. If the expression is false, the "value-if-false" is used. In the example above, you can partition the humidity column into two discrete values as follows:
IIF(humidity > 75,'humid','dry')
This will cause the analysis to use the value 'humid' for all humidity values greater than 75, and 'dry' for those less than or equal to 75. This is much more efficient and will significantly improve the performance of the analysis. We recommend using this technique for any continuous column where a good partitioning breakpoint is already known. An example command-line using this technique is:
odbcmine -f golf2.svg miningdb golf play outlook windy "IIF(humidity>75,'humid','dry')"
The resulting decision tree looks like:
Also, to create any number of distinct discrete values from a continuous range, you can nest the IIF expressions:
IIF(humidity>85,'humid',IIF(humidity>60,'moderate','dry'))
If the data being analyzed spans multiple tables, you can create a relational view that performs a query to join the data into one result set. Note that in MS Access, this is equivalent to creating a named query; for other databases, you can use the SQL CREATE VIEW statement. You can then specify the name of this view as the table to analyze.
ODBCMine writes decision trees in SVG format. We recommend downloading and installing the Adobe SVG Viewer plug-in for viewing SVG files using your internet browser (Internet Explorer or Netscape). More SVG software is listed at W3C SVG Implementations. You can convert SVG into PNG using the Batik SVG Rasterizer (requires a Java runtime):
java -jar batik-rasterizer.jar golf.svg
A common problem occurs when the size of the SVG tree is bigger than the display window. The Adobe SVG viewer does not provide scrolling facilities; as a workaround, you can reference the SVG file from another HTML file and launch your browser on the HTML file. This can be done via the HTML embed tag:
<embed type="image/svg+xml" src="odbcmine.svg" width="2000" height="400">
ODBCMine provides a command-line interface that allows analysis of any table or view within an ODBC data source. The resulting decision tree is written as an SVG file. Errors are written to stderr.
The format of the command line is:
odbcmine dsn table goal column [...column]
where:
'long data source name'
ODBCMine also accepts a number of optional arguments:
| -f fname | write output to named file. If this argument is not supplied, the decision tree is written to odbcmine.svg |
| -u uid | userid for ODBC datasource |
| -p pwd | password for ODBC datasource |
| -d | debug mode; writes each SQL statement as it is executed to the console |
| -r | display the decision tree as a set of IF-ELSE rules |
| -v | draw tree in vertical orientation (default is horizontal) |
Any of the arguments may be quoted, as in:
odbcmine -f "c:\\my directory\\output.svg" miningdb golf play humidity
For example, to generate rules using the golfing example described in the article above, issue the following command at the command line (assuming that you have set up the sample miningdb.mdb database as an ODBC data source named miningdb). The output will be written to odbcmine.svg:
odbcmine miningdb golf play outlook humidity windy
ODBCMine is a 32-bit Windows executable, suitable for use on Windows 98, Windows NT, Windows 2000, or Windows XP. The distribution file in .ZIP format is available at odbcmine.zip and contains the following files:
You can download the Adobe SVG Viewer here:
To install ODBCMine, unpack the distribution ZIP file into
a directory on your hard drive (normally c:\ODBCMine),
and place the name of the directory into your PATH variable.
To run the sample cases in golf.bat and stock.bat, you must have the MS Access ODBC Driver (not included). Also, you must define an ODBC data source named miningdb and point it to the supplied miningdb.mdb file. You should refer to the ODBC documentation under the MS Windows control panel for more information about setting up ODBC data sources.
We are planning future enhancements to ODBCMine to:
ODBCMine is distributed as shareware. The evaluation version limits the number of records to be analyzed to 15, and the number of columns to 4. The registered version removes these limitations. You may try it freely for a period of 30 days, after which you must register it. The following commercial licenses are available:
| License | Price | Description |
|---|---|---|
| ODBCMine Single-User License | $49.95 | This license includes use on a single machine by a single user. |
Registration provides up to 1 hour of free technical support. Sorry, we cannot provide support to unregistered users. Contact us for site licenses and other support plans. Academic users may obtain and use the registered version for free via special request; please email us with your name, address, and course details, and we will provide download instructions.
ODBCMine is Copyright © 2003 Intelligent Systems Research LLC. All Rights Reserved. Your use of the program indicates acceptance of the License Agreement with Intelligent Systems Research LLC.
Questions? Comments? Contact us at info@intsysr.com