"Scalable Data Mining with Model Constraints"

by Minos Garofalakis and Rajeev Rastogi.
SIGKDD Explorations, Vol. 2, No. 2, December 2000 (Special Issue on Scalable Data Mining), pp. 39-48.


Data mining can be abstractly defined as the process of extracting concise and interesting models (or, patterns) from large amounts of data. Unfortunately, conventional mining systems provide users with only very restricted mechanisms for specifying models of interest. As a consequence, the mining process is typically characterized by lack of focus and users often end up paying computational costs that are inordinately high compared to the specific models/patterns of interest. Exploiting user-defined model constraints during the mining process can help alleviate this problem and ensure system performance that is commensurate with the level of user focus. Attaining such performance goals, however, is not straightforward and, typically, requires the design of novel data mining algorithms that make effective use of the model constraints. In this paper, we provide an overview of our recent work on scalable, constraint-based algorithms for (1) decision tree construction with size and accuracy constraints for the desired decision tree model, and (2) sequential pattern extraction in the presence of structural, regular expression constraints for the target patterns. By "pushing" the model constraints inside the mining process, our algorithms give mining users exactly the models that they are looking for, while achieving performance speedups that often exceed one order of magnitude. Further, our work on sequential pattern mining has uncovered some valuable insights into the tradeoffs that arise when complex constraints that do not subscribe to "nice" properties (like anti-monotonicity) are integrated into the mining process. We argue that, in general, a cost-based approach (similar to that employed in conventional query optimizers) is needed to explore these tradeoffs in a principled manner and produce effective execution plans for ad-hoc mining queries.

[ paper (pdf) (ps.gz) ]

Copyright © 2000, Association for Computing Machinery, Inc. (ACM). Permission to make digital/hard copy of all or part of this material without fee is granted provided that copies are not made or distributed for profit or commercial advantage, the ACM copyright/server notice, the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery, Inc. (ACM). To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.