Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

A General Framework for Mining Massive Data Streams

A General Framework for Mining Massive Data Streams In many domains, data now arrive faster than we are able to mine it. To avoid wasting these data, we must switch from the traditional “one-shot” data mining approach to systems that are able to mine continuous, high-volume, open-ended data streams as they arrive. In this article we identify some desiderata for such systems, and outline our framework for realizing them. A key property of our approach is that it minimizes the time required to build a model on a stream while guaranteeing (as long as the data are iid) that the model learned is effectively indistinguishable from the one that would be obtained using infinite data. Using this framework, we have successfully adapted several learning algorithms to massive data streams, including decision tree induction, Bayesian network learning, k-means clustering, and the EM algorithm for mixtures of Gaussians. These algorithms are able to process on the order of billions of examples per day using off-the-shelf hardware. Building on this, we are currently developing software primitives for scaling arbitrary learning algorithms to massive data streams with minimal effort. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Journal of Computational and Graphical Statistics Taylor & Francis

A General Framework for Mining Massive Data Streams

5 pages

Loading next page...
 
/lp/taylor-francis/a-general-framework-for-mining-massive-data-streams-p3rv0N9J3g

References (6)

Publisher
Taylor & Francis
Copyright
© American Statistical Association, Institute of Mathematical Statistics, and Interface Foundation of North America
ISSN
1537-2715
eISSN
1061-8600
DOI
10.1198/1061860032544
Publisher site
See Article on Publisher Site

Abstract

In many domains, data now arrive faster than we are able to mine it. To avoid wasting these data, we must switch from the traditional “one-shot” data mining approach to systems that are able to mine continuous, high-volume, open-ended data streams as they arrive. In this article we identify some desiderata for such systems, and outline our framework for realizing them. A key property of our approach is that it minimizes the time required to build a model on a stream while guaranteeing (as long as the data are iid) that the model learned is effectively indistinguishable from the one that would be obtained using infinite data. Using this framework, we have successfully adapted several learning algorithms to massive data streams, including decision tree induction, Bayesian network learning, k-means clustering, and the EM algorithm for mixtures of Gaussians. These algorithms are able to process on the order of billions of examples per day using off-the-shelf hardware. Building on this, we are currently developing software primitives for scaling arbitrary learning algorithms to massive data streams with minimal effort.

Journal

Journal of Computational and Graphical StatisticsTaylor & Francis

Published: Dec 1, 2003

Keywords: Data mining; Hoeffding bounds; Machine learning; Scalability; Subsampling

There are no references for this article.