|
Data Mining: Next Generation Challenges and Future Directions
H. Kargupta, A. Joshi, K. Sivakumar, and Y. Yesha
MIT/AAAI Press
Available from Barnes & Noble, Amazon, among others
Content:
Pervasive, Distributed, and Stream Data Mining
-
Existential Pleasures of Distributed Data Mining
H. Kargupta and K. Sivakumar
-
Research Issues in Mining and Monitoring of Intelligence Data
Alan Demers, Johannes Gehrke, and Mirek Riedewald
-
A Consensus Framework for Integrating Distributed Clusterings Under Limited Knowledge Sharing
Joydeep Ghosh, Alexander Strehl, Srujana Merugu
-
Design of Distributed Data Mining Applications on the KNOWLEDGE GRID
Mario Cannataro, Domenico Talia, Paolo Trunfio
-
Photonic Data Services: Integrating Data, Network and Path Services to Support Next Generation Data Mining Applications
Robert L. Grossman, Yunhong Gu, Dave Hanley, Xinwei Hong, Jorge Levera and Marco Mazzucco, Dave Lillethun, Joe Mambretti, and Jeremy Weinberger
-
Mining Frequent Patterns in Data Streams at Multiple Time Granularities
Chris Giannella, Jiawei Han, Jian Pei, Xifeng Yan, Philip S. Yu
-
Efficient Data-Reduction Methods for On-Line Association Rule Discovery
Herve Bronnimann, Bin Chen, Manoranjan Dash, Peter Haas, Yi Qiao, Peter Scheuermann
-
Discovering Recurrent Events in Multichannel Data Streams Using Unsupervised Methods
Milind R. Naphade, Chung-Sheng Li, Thomas S. Huang
Counter-Terrorism, Privacy, and Data Mining
-
Data Mining for Counter-Terrorism
Bhavani Thuraisingham
-
Biosurveillance and Outbreak Detection
Paola Sebastiani, Kenneth D. Mandl
-
MINDS - Minnesota Intrusion Detection System
Levent Ertoz, Eric Eilertson, Aleksandar Lazarevic, Pang-Ning Tan, Vipin Kumar, Jaideep Srivastava, Paul Dokas
-
Marshalling Evidence Through Data Mining in Support of Counter Terrorism
Daniel Barbara, James J. Nolan, David Schum, Arun Sood
-
Relational Data Mining with Inductive Logic Programming for Link Discovery
Raymond J. Mooney, Prem Melville, Lappoon Rupert Tang, , Jude Shavlik, Ines de Castro Dutra, David Page, Vtor Santos Costa
-
Defining Privacy for Data Mining
Chris Clifton, Murat Kantarcoglu, Jaideep Vaidya
Scientific Data Mining
-
Mining Temporally-Varying Phenomena in Scientific Datasets
R. Machiraju, S. Parthasarathy, J. Wilkins, D. Thompson, B. Gatlin, D. Richie, T. Choy, M. Jiang, S. Mehta, M. Coatney, S. Barr, K. Hazzard
-
Methods for Mining Protein Contact Maps
Mohammed J. Zaki, Jingjing Hu, and Chris Bystroffz
-
Mining Scientific Data Sets using Graphs
Michihiro Kuramochi, Mukund Deshpande, and George Karypis
-
Challenges in Environmental Data Warehousing and Mining
Nabil R. Adam, Vijayalakshmi Atluri, Dihua Guo, Songmei Yu
-
Trends in Spatial Data Mining
Shashi Shekhar, Pusheng Zhang, Yan Huang, Ranga Raju Vatsavai
-
Challenges in Scientific Data Mining: Heterogeneous, Biased, and Large Samples
Zoran Obradovic and Slobodan Vucetic
Web, Semantics and Data Mining
-
Web Mining - Concepts, Applications & Research Directions
Jaideep Srivastava, Prasanna Desikan, Vipin Kumar
-
Advancements in Text Mining Algorithms and Software
Svetlana Y. Mironova, Michael W. Berry, Scott Atchley, Micah Beck, Tianhao Wu,
Lars E. Holzman, William M. Pottenger, Daniel J. Phelps
-
On Data Mining, Semantics, and Intrusion Detection: What to dig for and Where to find it
Anupam Joshi and Jeffrey Undercoffer
-
Usage Mining for and on the SemanticWeb
Bettina Berendt, Gerd Stumme, and Andreas Hotho
==================================================
Content (With Abstracts):
Pervasive, Distributed, and Stream Data Mining
Existential Pleasures of Distributed Data Mining
H. Kargupta and K. Sivakumar
Abstract: This paper presents an overview of the field of distributed data mining (DDM).
It particularly focuses on the DDM algorithms that work in loosely coupled distributed
environments. It discusses several emerging DDM-application-areas, such as mining over
wireless networks, large-scale distributed scientific data mining, and privacy sensitive
multi-party data mining. The paper reviews a wide range of DDM algorithms for these
application areas and points out some of the future directions of the field. Keywords:
Distributed data mining, mobile data mining, privacy-sensitive multi-party data mining
Research Issues in Mining and Monitoring of Intelligence Data
Alan Demers, Johannes Gehrke, and Mirek Riedewald
Abstract:
In this paper we describe research problems which need to be addressed to enable
efficient and effective mining and monitoring of intelligence data. We first review
the basic architecture of such a system, and then outline open research problems in
areas such as multi-query optimization, online data mining, high-speed archiving,
foundations of stream computations, distributed data mining, and data mining model
management. We also present recent results which we extend in our ongoing research
towards the goal of building a distributed mining and monitoring application.
A Consensus Framework for Integrating Distributed Clusterings Under Limited Knowledge Sharing
Joydeep Ghosh, Alexander Strehl, Srujana Merugu
Abstract:
This article provides an illustration of next-generation privacy-preserving data mining
by exploring the problem of combining multiple partitionings of a set of objects
into a single consolidated clustering without accessing the features or algorithms that
determined these partitionings. This problem is an abstraction of scenarios where different
organizations have grouped some or all elements of a common underlying population,
possibly using different features, algorithms or clustering criteria. Moreover,
due to real life constraints such as proprietary techniques, legal restrictions, different
data ownerships etc, it is not feasible to pool all the data into a central location and
then apply clustering techniques: the only information that can be shared are the symbolic
cluster labels. The cluster ensemble problem is formalized as a combinatorial
optimization problem that obtains a consensus function in terms of shared mutual information
among individual solutions. Three effective and efficient techniques for obtaining
high-quality consensus functions are described and studied empirically for the
following qualitatively different application scenarios: (i) where the original clusters
were formed based on non-identical sets of features, (ii) where the original clustering
algorithms were applied to non-identical sets of objects and (iii) when the individual
solutions provide varying numbers of clusters. Promising results are obtained in all
the three situations for synthetic as well as real data sets, even under severe restrictions
on data and knowledge sharing. These results motivate the further exploration of
consensus techniques for a wider class of distributed data mining applications.
Keywords: distributed clustering, privacy-preserving data mining, cluster ensembles,
knowledge reuse, consensus functions, mutual Information, distributed data mining.
Design of Distributed Data Mining Applications on the KNOWLEDGE GRID
Mario Cannataro, Domenico Talia, Paolo Trunfio
Abstract:
Many industrial, scientific, and commercial applications need to analyze large data sets
maintained over geographically distributed sites. The geographic distribution and the
large amount of data involved often oblige designers to use distributed and parallel
systems. The Grid can play a significant role in providing an effective computational
support for distributed data mining and knowledge discovery applications. This paper
introduces the KNOWLEDGE GRID, a Grid-based software system for geographically
distributed knowledge discovery applications. The main system components are
described and the design and implementation steps of distributed data mining applications
on Grids using these components are discussed.
Photonic Data Services: Integrating Data, Network and Path Services to Support Next Generation Data Mining Applications
Robert L. Grossman, Yunhong Gu, Dave Hanley, Xinwei Hong, Jorge Levera and Marco Mazzucco, Dave Lillethun, Joe Mambretti, and Jeremy Weinberger
Abstract
We describe an architecture for next generation, distributed
data mining systems which integrates data services
to facilitate remote data analysis and distributed data mining,
network protocol services for high performance data
transport, and path services for optical paths. We also
present experimental evidence using geoscience data that
this architecture scales the remote analysis of Gigabyte size
data sets over long haul, high performance networks.
Mining Frequent Patterns in Data Streams at Multiple Time Granularities
Chris Giannella, Jiawei Han, Jian Pei, Xifeng Yan, Philip S. Yu
Abstract:
Although frequent-pattern mining has been widely studied and used, it is challenging
to extend it to data streams. Compared with mining a static transaction data set, the
streaming case has far more information to track and far greater complexity to manage.
Infrequent items can become frequent later on and hence cannot be ignored. The
storage structure need be dynamically adjusted to reflect the evolution of itemset frequencies
over time.
In this paper, we propose an approach based on computing and maintaining all the
frequent patterns (which is usually more stable and smaller than the streaming data) and
dynamically updating them with the incoming data stream. We extended the framework
to mine time-sensitive patterns with approximate support guarantee. We incrementally
maintain tilted-time windows for each pattern at multiple time granularities. Interesting
queries can be constructed and answered under this framework.
Moreover, inspired by the fact that the FP-tree provides an effective data structure
for frequent pattern mining, we develop FP-stream, an FP-tree-based data structure
for maintaining time sensitive frequency information about patterns in data streams.
The FP-stream can scanned to mine frequent patterns over multiple time granularities.
An FP-stream structure consists of (a) an in-memory frequent pattern-tree to
capture the frequent and sub-frequent itemset information, and (b) a tilted-time window
table for each frequent pattern. Efficient algorithms for constructing, maintaining
and updating an FP-stream structure over data streams are explored. Our analysis and
experiments show that it is realistic to maintain an FP-stream in data stream environments
even with limited main memory.
Efficient Data-Reduction Methods for On-Line Association Rule Discovery
Herve Bronnimann, Bin Chen, Manoranjan Dash, Peter Haas, Yi Qiao, Peter Scheuermann
Abstract:
Classical data mining algorithms require one or more computationally intensive passes
over the entire database and can be prohibitively slow. One effective method for dealing
with this ever-worsening scalability problem is to run the algorithms on a small
sample of the data. We present and empirically compare two data-reduction algorithms
for producing such a sample; these algorithms, called FAST and EA, are tailored to
count data applications such as association-rule mining. The algorithms are similar
in that both attempt to produce a sample whose distance appropriately defined
from the complete database is minimal. They differ greatly, however, in the way they
greedily search through the exponential number of possible samples. FAST, recently
developed by Chen et al., uses random sampling together with trimming of outlier
transactions. On the other hand, the EA algorithm, introduced in this chapter, repeatedly
and deterministically halves the data to obtain the final sample. Unlike FAST,
the EA algorithm provides a guaranteed level of accuracy. Our experiments show that
EA is more expensive to run than FAST, but yields more accurate results for a given
sample size. Thus, depending on the specific problem under consideration, the user
can trade off speed and accuracy by selecting the appropriate method. We conclude
by showing how the EA data-reduction approach can potentially be adapted to provide
data-reduction schemes for streaming data systems. The proposed schemes favor recent
data while still retaining partial information about all of the data seen so far.
Discovering Recurrent Events in Multichannel Data Streams Using Unsupervised Methods
Milind R. Naphade, Chung-Sheng Li, Thomas S. Huang
Abstract:
Detection of recurrent temporal patterns in digital media is a first step in the
next generation data mining for media content. Production videos such as
news, sports and movies have a definitive structure that involves short term
interaction as well as long term correlation. This structure in video can be
captured by models that take into consideration the short term statistics as well
as long term recurrence. We investigate the application of probabilistic models
that capture this structure. The novel approach is to characterize the short
term events in video by models that can account for temporal support in terms
of piece-wise stationary signals with transitions. These short term events can
then be embedded within another temporal model that accounts for transitions
between these event and thus characterizes long term history. This also leads
to the detection of recurring events in video using a monolithic model. The
proposed approach is an unsupervised algorithm for event detection and it can
be used for summarization, similarity based matching and enhanced browsing.
With certain extensions similar algorithms can be used for data mining for
temporal patterns in other domains such as bio-surveillance, where the multimodal
sensor streams may be comprised of traditional and non-traditional data
sources.
Counter-Terrorism, Privacy, and Data Mining
Data Mining for Counter-Terrorism
Bhavani Thuraisingham
Abstract:
Data mining is becoming a useful tool for detecting and preventing terrorism. This
paper first discusses some technical challenges for data mining as applied for counterterrorism
applications. Next it provides an overview of the various types of terrorist
threats and describes how data mining techniques could provide solutions to counterterrorism.
Finally some privacy concerns and potential solutions that could detect terrorist
activities and yet attempt to maintain privacy will be discussed.
Biosurveillance and Outbreak Detection
Paola Sebastiani, Kenneth D. Mandl
Abstract:
Faced with the very real threat of bioterrorism, the critical need for early detection of
an outbreak has shortened the time frame for major enhancements to our public health
infrastructure. The early detection of covert biological attacks requires real time data
streams revealing of the health of the population, as well as novel methods to detect
abnormalities.
MINDS - Minnesota Intrusion
Detection System
Levent Ertoz, Eric Eilertson, Aleksandar Lazarevic, Pang-Ning Tan, Vipin Kumar, Jaideep Srivastava, Paul Dokas
Abstract:
This paper introduces the Minnesota Intrusion Detection System (MINDS), which uses
a suite of data mining techniques to automatically detect attacks against computer networks
and systems. While the long-term objective of MINDS is to address all aspects
of intrusion detection, this paper focuses on two specific contributions: (i) an unsupervised
anomaly detection technique that assigns a score to each network connection that
reflects how anomalous the connection is, and (ii) an association pattern analysis based
module that summarizes those network connections that are ranked highly anomalous
by the anomaly detection module. Experimental results on live network traffic at the
University of Minnesota show that our anomaly detection techniques are very promising
and are successful in automatically detecting several novel intrusions that could
not be identified using popular signature-based tools such as SNORT. Furthermore,
given the very high volume of connections observed per unit time, association pattern
based summarization of novel attacks is quite useful in enabling a security analyst to
understand and characterize emerging threats.
Marshalling Evidence Through Data Mining in Support of Counter Terrorism
Daniel Barbara, James J. Nolan, David Schum, Arun Sood
Abstract:
In this chapter we present an architecture to manage a large, distributed volume of evidence
for counterterrorism applications. This approach facilitates the intelligence analyst’s
train of thought by enabling hypotheses to be postulated and evaluated against the
available evidence. The hypothesis creation is left to the human (as we believe only humans
possess the ¤exibility to adapt hypothesis to the dynamic nature of the problem),
but the system automates the gathering and linking of evidence that supports or negates
the hypothesis. We do this by extensive use of data mining techniques in support of
the process of query answering. The approach is incorporated into a distributed agent
system that allows users to discover and compose agents for hypothesis processing.
Relational Data Mining with Inductive Logic Programming for Link Discovery
Raymond J. Mooney, Prem Melville, Lappoon Rupert Tang, , Jude Shavlik, Ines de Castro Dutra, David Page, Vtor Santos Costa
Abstract:
Link discovery (LD) is an important task in data mining for counter-terrorism and is
the focus of DARPA’s Evidence Extraction and Link Discovery (EELD) research program.
Link discovery concerns the identification of complex relational patterns that
indicate potentially threatening activities in large amounts of relational data. Most
data-mining methods assume data is in the form of a feature-vector (a single relational
table) and cannot handle multi-relational data. Inductive logic programming is a form
of relational data mining that discovers rules in first-order logic from multi-relational
data. This paper discusses the application of ILP to learning patterns for link discovery.
Defining Privacy for Data Mining
Chris Clifton, Murat Kantarcoglu, Jaideep Vaidya
Abstract:
Privacy preserving data mining getting valid data mining results without learning the
underlying data values has been receiving attention in the research community and
beyond. It is unclear what privacy preserving means. This paper provides a framework
and metrics for discussing the meaning of privacy preserving data mining, as a
foundation for further research in this field.
Scientific Data Mining
Mining Temporally-Varying Phenomena in Scientific Datasets
R. Machiraju, S. Parthasarathy, J. Wilkins, D. Thompson, B. Gatlin, D. Richie, T. Choy, M. Jiang, S. Mehta, M. Coatney, S. Barr, K. Hazzard
Abstract:
Simulation is enhancing and, in many instances, replacing experimentation as a means
to gain insight into complex physical phenomena. Recent advances in computer hardware
and numerical methods have made it possible to simulate physical phenomena at
very fine temporal and spatial resolutions. Unfortunately, given the enormous sizes of
the datasets involved, analyzing datasets produced by these simulations is extremely
challenging. In order to more fully exploit simulation, the analysis of these large
datasets must advance beyond current techniques that are based on interactive visualization.
We outline our vision for one such approach and describe progress on a unified
framework that promises to provide a novel method to explore large simulation data
sets. We illustrate its application to two disparate science drivers – temporally varying
solid and fluid systems. In both applications, there are hidden hierarchies of features
as well as many abstract multidimensional feature characterizations (e.g. shapes).
Through this framework, we offer a systematic approach to detect, characterize, and
track meta-stable features as well as formulate hypotheses about their evolution – an
important step in extracting vital information from such complex systems.
Methods for Mining Protein Contact Maps
Mohammed J. Zaki, Jingjing Hu, and Chris Bystroffz
Abstract
The 3D conformation of a protein may be compactly represented in a symmetrical,
square, boolean matrix of pairwise, inter-residue contacts, or contact map. The contact
map provides a host of useful information about the protein’s structure. In this
paper we describe how data mining can be used to extract valuable information from
contact maps. For example, clusters of contacts represent certain secondary structures,
and also capture non-local interactions, giving clues to the tertiary structure. We show
that using contact maps and a hybrid mining approach, we can construct contact rules
to predict the structure of an unknown protein. Furthermore, we mine non-local frequent
dense contact patterns that discriminate physical from non-physical maps. We
cluster these patterns based on their similarities and evaluate the clustering quality, and
show that our techniques are effective in characterizing contact patterns across different
proteins.
Mining Scientific Data Sets using Graphs
Michihiro Kuramochi, Mukund Deshpande, and George Karypis
Abstract:
As data mining techniques are being increasingly applied to scientific and other nontraditional
domains, it is becoming apparent that existing approaches for modeling the
characteristics of the underlying physical phenomena and processes are limiting as they
cannot model the requirement of these domains. An alternate way is to use graphs to
capture in a single and unified framework many of the spatial, topological, geometric,
and other types of relational characteristics present in such datasets. The added power
provided by graph-modeling can only be realized if computationally efficient and scalable
algorithms for many of the classical data-mining tasks, such as frequent pattern
discovery, clustering, and classification, become available. In this chapter we present
some of our research on developing algorithms for finding frequently occurring patterns
and building predictive models for graph datasets.
Challenges in Environmental Data Warehousing and Mining
Nabil R. Adam, Vijayalakshmi Atluri, Dihua Guo, Songmei Yu
Abstract: In the area of Environmental and Earth sciences, we are concerned with
the collection, assimilation, cataloging and dissemination or retrieval of a vast array of
environmental data, which include satellite imagery of different resolutions captured
by different sensors, aerial photos, radar and other monitoring networks, models of the
topography, spatial attributes of the landscape such as roads, rivers, parcels, schools,
zip code areas, city streets and administrative boundaries, census information, digital
terrain models, etc. In this paper, we describe the complex nature of the queries posed
to such databases and argue that data warehousing and data mining techniques must be
employed to efficiently process these queries. We then identify the research challenges
in environmental data warehousing and mining. In particular, we identify research
related to data model and OLAP operations, as well as the challenges in extending
traditional data mining techniques to to mine environmental data.
Trends in Spatial Data Mining
Shashi Shekhar, Pusheng Zhang, Yan Huang, Ranga Raju Vatsavai
Abstract
Spatial data mining is the process of discovering interesting and
previously unknown, but potentially useful patterns from large
spatial datasets. Extracting interesting and useful patterns from
spatial datasets is more difficult than extracting the
corresponding patterns from traditional numeric and categorical
data due to the complexity of spatial data types, spatial
relationships, and spatial autocorrelation. This chapter
focuses on the unique features that distinguish spatial data mining
from classical data mining. Major accomplishments and research needs
in spatial data mining research are discussed.
Challenges in Scientific Data Mining: Heterogeneous, Biased, and Large Samples
Zoran Obradovic and Slobodan Vucetic
Abstract:
With recent innovations in data collection technologies in many emerging scientific
applications there is a need for a paradigm shift from a traditional hypothesize-andtest
process to a partial automation of hypothesis generation, model construction and
experimentation. To develop appropriate knowledge discovery tools in support of the
paradigm shift in sciences numerous data mining challenges should be solved. The
next generation data mining challenges considered in this article include (1) mining of
heterogeneous data; (2) learning from biased labeled samples; (3) enhancing predictive
modeling by exploiting large unlabeled databases; and (4) reducing large datasets
with a controllable information loss for a more efficient mining in a distributed environment.
Several effective methods for addressing these problems developed at our
laboratory are described. Applications are illustrated on six scientific and engineering
problems in precision agriculture, electric power systems and biochemistry studied in
our collaboration with domain experts. It is demonstrated that some fairly general data
mining algorithmswith appropriatemodifications can be adopted for a more successful
mining of spatial observations, nonstationary time series and heterogeneous sequence
databases in specific scientific applications.
Web, Semantics and Data Mining
Web Mining - Concepts, Applications & Research Directions
Jaideep Srivastava, Prasanna Desikan, Vipin Kumar
Abstract:
From its very beginning, the potential of extracting valuable knowledge from the Web
has been quite evident. Web mining, i.e. the application of data mining techniques to
extract knowledge from Web content, structure, and usage, is the collection of technologies
to fulfill this potential. Interest in Web mining has grown rapidly in its short
history, both amongst researchers and practitioners. This chapter provides a brief
overview of the accomplishments of the field, both in terms of technologies and applications,
and outlines key future research directions.
Advancements in Text Mining Algorithms and Software
Svetlana Y. Mironova, Michael W. Berry, Scott Atchley, Micah Beck, Tianhao Wu,
Lars E. Holzman, William M. Pottenger, Daniel J. Phelps
Abstract:
In this chapter, we present two advancements in the development of algorithms and
software for the mining of textual information. For large-scale indexing needs, we
present the General Text Parser (GTP) software environment with network storage capability.
This object-oriented software (C++, Java) is designed to provide information
retrieval (IR) and data mining specialists the ability to parse and index large text collections.
GTP utilizes Latent Semantic Indexing (or LSI) for the construction of a vector
space IR model. Users can choose to store the files generated by GTP on a remote
network in order to overcome local storage restrictions and facilitate file sharing. For
feature extraction from textual information, we present a supervised learning algorithm
for the discovery of finite state automata, in the form of regular expressions, in textual
data. The automata generate languages that consist of various representations of features
useful in information extraction. This learning technique has been used in the
extraction of textual features from police incident reports and patent data.
On Data Mining, Semantics, and Intrusion Detection: What to dig for and Where to find it
Anupam Joshi and Jeffrey L. Undercoffer
Abstract:
We examine the intersection of data mining with the semantic web, in particular
with ontologies and reasoning. We show how the two can operate in synergy to
better distributed data mining, in particular Distributed Intrusion Detection. We have
developed a target-centric ontological framework that models computer attacks. Our
ontology is based upon an analysis of over 4,000 classes of computer attacks and their
corresponding attack strategies. We are using data mining techniques to identify and
specify instances of the ontology. We illustrate the utility of an ontology with an example
of the Mitnick Attack, an attack that combine a denial of service and TCP hijacking
in a distributed manner. We then define an Intrusion Detection System model that
employs data mining techniques for initial hypothesis testing and then uses semantic
reasoning to further test events that are initially identified as falling outside the bounds
of normal system activity.
Usage Mining for and on the SemanticWeb
Bettina Berendt, Gerd Stumme, and Andreas Hotho
Abstract:
Semantic Web Mining aims at combining the two fast-developing research areas SemanticWeb
andWeb mining. Web mining aims at discovering insights about themeaning
of Web resources and their usage. Given the primarily syntactical nature of data
Web mining operates on, the discovery of meaning is impossible based on these data
only. Therefore, formalizations of the semantics of Web resources and navigation behavior
are increasingly being used. This fits exactly with the aims of the SemanticWeb:
the SemanticWeb enriches theWWWby machine-processable information which supports
the user in his tasks. In this paper, we discuss the interplay of the Semantic Web
with Web mining, with a specific focus on usage mining.
| |