Decision Tree Analysis Using ODBCMine

ODBCMine is a data-mining tool that generates decision trees. It analyzes the data in ODBC data sources using the C4.5 algorithm, and creates graphical decision trees in Scalable Vector Graphics (SVG) format. It is intended to be a simple and inexpensive -- yet powerful -- implementation of a classic machine learning/data-mining technique.

ODBCMine is distributed as shareware. You can download the evaluation version and use it for 30 days. This version limits the number of rows (15) and columns (4) in tables that can be analyzed; otherwise it is identical to the registered version.

Immediate Order Fulfillment!
If you register ODBCMine online at RegNet you will immediately be sent an email with the link to the unrestricted version.

Sale Price!
A single-user license is just $49.95.

The theory behind ODBCMine is described in detail in the online article Building Classification Models: ID3 and C4.5 from Temple University, and in the article The ID3 Algorithm. We have implemented the basic ID3 algorithm described in these articles as a Win32 executable, and added logic for some of the C4.5 extensions as well as for ODBC connectivity. We've also provided the sample data cases from the first article (the StockMarket and Golf examples) as an MS Access 97 database.

Some possible applications of ODBCMine include credit risk analysis and customer behavior prediction. An interesting discussion of business data-mining applications can be read online in the CIO Magazine Data Mining Article. We also note that an earlier version of ODBCMine is used as a tool in the ECT 584 Web Data Mining graduate-level class at DePaul University.

Example

For example, assume that you have kept a record of all the days in which you have gone golfing in the past year. You've included a record of the weather conditions on that day (was it windy, what was the temperature, what was the general outlook for clouds, etc.), and you've also recorded whether or not you actually did play under those conditions. You have stored your data as a table in a relational database:

Outlook Windy Humidity Temperature Play
sunny false 85 85 no
sunny true 90 80 no
overcast false 78 83 yes
rain false 96 70 yes
rain false 80 68 yes
rain true 70 65 no
overcast true 65 64 yes
sunny false 95 72 no
sunny false 70 69 yes
rain false 80 75 yes
sunny true 70 75 yes
overcast true 90 72 yes
overcast false 75 81 yes
rain true 80 71 no

Using ODBCMine, you can analyze this data to construct a decision tree (or algorithm) for predicting whether or not you should play on future days. To run an analysis, you must specify the name of the database table to analyze, the name of the goal column (the table column that represents the final decision desired), and the names of one or more columns that are factors in making the decision. For our example, the name of the table is golf, the goal for the final decision is play, and the decision factor columns are outlook, windy, and humidity. Given this input, ODBCMine will generate a decision tree showing how a decision on whether to play can be made:

For each decision result, ODBCMine calculates a confidence factor. A factor of 100% indicates that the decision path should always be correct, based on the training set data. Lesser values imply errors; the factor is the percentage of values in the training set classified correctly by that path of the tree.

ODBCMine can also display the decision tree as a set of IF-THEN rules:

if outlook = overcast then play = yes (100%)

if outlook = rain and windy = false then play = yes (100%)

if outlook = rain and windy = true then play = no (100%)

if outlook = sunny and humidity <= 70 then play = yes (100%)

if outlook = sunny and humidity > 70 then play = no (100%)

As another example, consider a table of stock market profits based on the age of the company, whether the company has competition, and the market sector (hwr for hardware, swr for software):

Age Competition Type Profit
old yes swr down
old no swr down
old no hwr down
mid yes swr down
mid yes hwr down
mid no hwr up
mid no swr up
new yes swr up
new no hwr up
new no swr up

Here is the resulting decision tree. This example shows the tree drawn in vertical orientation instead of the default horizontal orientation:

Discrete vs. Continuous Columns

The C4.5 algorithm used in ODBCMine can analyze columns that have discrete values ('low','high', etc.) as well as numeric values with continuous ranges. However, analyzing continuous columns is significantly more expensive, and requires that ODBCMine issue a query for each distinct numeric value found in the table. Internally, ODBCMine will partition the continuous range into two discrete sets: less than or equal to a breakpoint, and greater than a breakpoint. ODBCMine will choose the breakpoint that gives the best decision tree possible.

You can manually convert continuous columns into discrete columns using the IIF() function provided by many databases (including MS Access). This function has the form:

IIF(expression,value-if-true,value-if-false)

Instead of using the column value directly, the DBMS will evaluate the expression, and if the expression is true, will substitute the "value-if-true" value. If the expression is false, the "value-if-false" is used. In the example above, you can partition the humidity column into two discrete values as follows:

IIF(humidity > 75,'humid','dry')

This will cause the analysis to use the value 'humid' for all humidity values greater than 75, and 'dry' for those less than or equal to 75. This is much more efficient and will significantly improve the performance of the analysis. We recommend using this technique for any continuous column where a good partitioning breakpoint is already known. An example command-line using this technique is:

odbcmine -f golf2.svg miningdb golf play outlook windy "IIF(humidity>75,'humid','dry')"

The resulting decision tree looks like:

Also, to create any number of distinct discrete values from a continuous range, you can nest the IIF expressions:

IIF(humidity>85,'humid',IIF(humidity>60,'moderate','dry'))

Multiple Tables

If the data being analyzed spans multiple tables, you can create a relational view that performs a query to join the data into one result set. Note that in MS Access, this is equivalent to creating a named query; for other databases, you can use the SQL CREATE VIEW statement. You can then specify the name of this view as the table to analyze.

Viewing SVG Decision Trees

ODBCMine writes decision trees in SVG format. We recommend downloading and installing the Adobe SVG Viewer plug-in for viewing SVG files using your internet browser (Internet Explorer or Netscape). More SVG software is listed at W3C SVG Implementations. You can convert SVG into PNG using the Batik SVG Rasterizer (requires a Java runtime):

java -jar batik-rasterizer.jar golf.svg

A common problem occurs when the size of the SVG tree is bigger than the display window. The Adobe SVG viewer does not provide scrolling facilities; as a workaround, you can reference the SVG file from another HTML file and launch your browser on the HTML file. This can be done via the HTML embed tag:

<embed type="image/svg+xml" src="odbcmine.svg" width="2000" height="400">

Usage

ODBCMine provides a command-line interface that allows analysis of any table or view within an ODBC data source. The resulting decision tree is written as an SVG file. Errors are written to stderr.

The format of the command line is:

odbcmine dsn table goal column [...column]

where:

  • dsn is the ODBC data source name to connect to; this may be optionally quoted, as in: 'long data source name'
  • table is the name of a database table (or view) to analyze, containing the categorical and non-categorical attributes as columns
  • goal is the name of the column representing the goal of the decision. This must be a CHAR or VARCHAR column with discrete (non-continuous) values.
  • column is the name of a column that is a factor in the decision. This may be a column of any numeric type; character columns are assumed to contain distinct values (as in the ID3 algorithm), and all non-character columns are treated as having continuous ranges (as in the C4.5 extensions)

    ODBCMine also accepts a number of optional arguments:

    -f fnamewrite output to named file. If this argument is not supplied, the decision tree is written to odbcmine.svg
    -u uiduserid for ODBC datasource
    -p pwdpassword for ODBC datasource
    -ddebug mode; writes each SQL statement as it is executed to the console
    -rdisplay the decision tree as a set of IF-ELSE rules
    -vdraw tree in vertical orientation (default is horizontal)

    Any of the arguments may be quoted, as in:

    odbcmine -f "c:\\my directory\\output.svg" miningdb golf play humidity

    For example, to generate rules using the golfing example described in the article above, issue the following command at the command line (assuming that you have set up the sample miningdb.mdb database as an ODBC data source named miningdb). The output will be written to odbcmine.svg:

    odbcmine miningdb golf play outlook humidity windy

    Obtaining ODBCMine

    ODBCMine is a 32-bit Windows executable, suitable for use on Windows 98, Windows NT, Windows 2000, or Windows XP. The distribution file in .ZIP format is available at odbcmine.zip and contains the following files:

  • odbcmine.htm (this documentation file)
  • odbcmine.exe (ODBCMine executable)
  • odbcmine.err (error messages)
  • miningdb.mdb (MS Access 97 file containing the Golf and StockMarket tables described in the article)
  • golf.bat (sample command line to run the golfing decision example)
  • stock.bat (sample command line to run the stock market decision example)
  • stock2.bat (sample command line demonstrating technique for mapping a continuous column into a discrete column)
  • license.htm (license agreement)

    You can download the Adobe SVG Viewer here:

    Installation

    To install ODBCMine, unpack the distribution ZIP file into a directory on your hard drive (normally c:\ODBCMine), and place the name of the directory into your PATH variable.

    To run the sample cases in golf.bat and stock.bat, you must have the MS Access ODBC Driver (not included). Also, you must define an ODBC data source named miningdb and point it to the supplied miningdb.mdb file. You should refer to the ODBC documentation under the MS Windows control panel for more information about setting up ODBC data sources.

    Release Notes

  • Version 2.1, 25-April-2003: new option to print decision tree as if-then decision rules
  • Version 2.0, 21-March-2003: initial release of version 2.0

    We are planning future enhancements to ODBCMine to:

  • allow pruning of decision tree nodes;
  • apply decision trees to new ODBC data to predict values of the decision attribute;
  • save decision trees in XML format (PMML); and
  • provide a DLL version that can be called from other applications (and redistributed).

    Known Limitations

  • Testing has been done only under MS Access and the Microsoft Text ODBC Driver.
  • Full ODBC connection strings are not supported.
  • Your ODBC database driver must support the IIF() function in SQL, in order to analyze continuous (numeric-valued) columns. MS Access and Oracle support this; other databases may not. Consult your driver documentation.
  • Your ODBC database driver must support the SQLGetInfo() function, to obtain the identifier quote character (used when table or column names contain spaces). Consult your ODBC driver documentation.
  • The analysis does not work on columns containing binary, DATE, TIME, and TIMESTAMP data.

    Registration & Licensing

    ODBCMine is distributed as shareware. The evaluation version limits the number of records to be analyzed to 15, and the number of columns to 4. The registered version removes these limitations. You may try it freely for a period of 30 days, after which you must register it. The following commercial licenses are available:

    License Price Description
    ODBCMine Single-User License$49.95This license includes use on a single machine by a single user.

    Registration provides up to 1 hour of free technical support. Sorry, we cannot provide support to unregistered users. Contact us for site licenses and other support plans. Academic users may obtain and use the registered version for free via special request; please email us with your name, address, and course details, and we will provide download instructions.

    ODBCMine is Copyright © 2003 Intelligent Systems Research LLC. All Rights Reserved. Your use of the program indicates acceptance of the License Agreement with Intelligent Systems Research LLC.

    Questions? Comments? Contact us at info@intsysr.com