UMBC Dr. Hillol Kargupta

Books
	New Book!! Data Mining: Next Generation Challenges and Future Directions H. Kargupta, A. Joshi, K. Sivakumar, and Y. Yesha MIT/AAAI Press Available from Barnes & Noble, Amazon, among others Advances in Distributed and Parallel Knowledge Discovery Edited by Hillol Kargupta and Philip Chan MIT/AAAI Press

Data Mining: Next Generation Challenges and Future Directions

H. Kargupta, A. Joshi, K. Sivakumar, and Y. Yesha
MIT/AAAI Press

Available from Barnes & Noble, Amazon, among others

Content:

Pervasive, Distributed, and Stream Data Mining

Existential Pleasures of Distributed Data Mining
H. Kargupta and K. Sivakumar

Research Issues in Mining and Monitoring of Intelligence Data
Alan Demers, Johannes Gehrke, and Mirek Riedewald

A Consensus Framework for Integrating Distributed Clusterings Under Limited Knowledge Sharing
Joydeep Ghosh, Alexander Strehl, Srujana Merugu

Design of Distributed Data Mining Applications on the KNOWLEDGE GRID
Mario Cannataro, Domenico Talia, Paolo Trunfio

Photonic Data Services: Integrating Data, Network and Path Services to Support Next Generation Data Mining Applications
Robert L. Grossman, Yunhong Gu, Dave Hanley, Xinwei Hong, Jorge Levera and Marco Mazzucco, Dave Lillethun, Joe Mambretti, and Jeremy Weinberger

Mining Frequent Patterns in Data Streams at Multiple Time Granularities
Chris Giannella, Jiawei Han, Jian Pei, Xifeng Yan, Philip S. Yu

Efficient Data-Reduction Methods for On-Line Association Rule Discovery
Herve Bronnimann, Bin Chen, Manoranjan Dash, Peter Haas, Yi Qiao, Peter Scheuermann

Discovering Recurrent Events in Multichannel Data Streams Using Unsupervised Methods
Milind R. Naphade, Chung-Sheng Li, Thomas S. Huang

Counter-Terrorism, Privacy, and Data Mining

Data Mining for Counter-Terrorism
Bhavani Thuraisingham

Biosurveillance and Outbreak Detection
Paola Sebastiani, Kenneth D. Mandl

MINDS - Minnesota Intrusion Detection System
Levent Ertoz, Eric Eilertson, Aleksandar Lazarevic, Pang-Ning Tan, Vipin Kumar, Jaideep Srivastava, Paul Dokas

Marshalling Evidence Through Data Mining in Support of Counter Terrorism
Daniel Barbara, James J. Nolan, David Schum, Arun Sood

Relational Data Mining with Inductive Logic Programming for Link Discovery
Raymond J. Mooney, Prem Melville, Lappoon Rupert Tang, , Jude Shavlik, Ines de Castro Dutra, David Page, Vtor Santos Costa

Defining Privacy for Data Mining
Chris Clifton, Murat Kantarcoglu, Jaideep Vaidya

Scientific Data Mining

Mining Temporally-Varying Phenomena in Scientific Datasets
R. Machiraju, S. Parthasarathy, J. Wilkins, D. Thompson, B. Gatlin, D. Richie, T. Choy, M. Jiang, S. Mehta, M. Coatney, S. Barr, K. Hazzard

Methods for Mining Protein Contact Maps
Mohammed J. Zaki, Jingjing Hu, and Chris Bystroffz

Mining Scientific Data Sets using Graphs
Michihiro Kuramochi, Mukund Deshpande, and George Karypis

Challenges in Environmental Data Warehousing and Mining
Nabil R. Adam, Vijayalakshmi Atluri, Dihua Guo, Songmei Yu

Trends in Spatial Data Mining
Shashi Shekhar, Pusheng Zhang, Yan Huang, Ranga Raju Vatsavai

Challenges in Scientific Data Mining: Heterogeneous, Biased, and Large Samples
Zoran Obradovic and Slobodan Vucetic

Web, Semantics and Data Mining

Web Mining - Concepts, Applications & Research Directions
Jaideep Srivastava, Prasanna Desikan, Vipin Kumar

Advancements in Text Mining Algorithms and Software
Svetlana Y. Mironova, Michael W. Berry, Scott Atchley, Micah Beck, Tianhao Wu, Lars E. Holzman, William M. Pottenger, Daniel J. Phelps

On Data Mining, Semantics, and Intrusion Detection: What to dig for and Where to find it
Anupam Joshi and Jeffrey Undercoffer

Usage Mining for and on the SemanticWeb
Bettina Berendt, Gerd Stumme, and Andreas Hotho

==================================================

Content (With Abstracts):

Pervasive, Distributed, and Stream Data Mining

Existential Pleasures of Distributed Data Mining
H. Kargupta and K. Sivakumar
Abstract: This paper presents an overview of the field of distributed data mining (DDM). It particularly focuses on the DDM algorithms that work in loosely coupled distributed environments. It discusses several emerging DDM-application-areas, such as mining over wireless networks, large-scale distributed scientific data mining, and privacy sensitive multi-party data mining. The paper reviews a wide range of DDM algorithms for these application areas and points out some of the future directions of the field. Keywords: Distributed data mining, mobile data mining, privacy-sensitive multi-party data mining

Research Issues in Mining and Monitoring of Intelligence Data
Alan Demers, Johannes Gehrke, and Mirek Riedewald
Abstract: In this paper we describe research problems which need to be addressed to enable efficient and effective mining and monitoring of intelligence data. We first review the basic architecture of such a system, and then outline open research problems in areas such as multi-query optimization, online data mining, high-speed archiving, foundations of stream computations, distributed data mining, and data mining model management. We also present recent results which we extend in our ongoing research towards the goal of building a distributed mining and monitoring application.

A Consensus Framework for Integrating Distributed Clusterings Under Limited Knowledge Sharing
Joydeep Ghosh, Alexander Strehl, Srujana Merugu
Abstract: This article provides an illustration of next-generation privacy-preserving data mining by exploring the problem of combining multiple partitionings of a set of objects into a single consolidated clustering without accessing the features or algorithms that determined these partitionings. This problem is an abstraction of scenarios where different organizations have grouped some or all elements of a common underlying population, possibly using different features, algorithms or clustering criteria. Moreover, due to real life constraints such as proprietary techniques, legal restrictions, different data ownerships etc, it is not feasible to pool all the data into a central location and then apply clustering techniques: the only information that can be shared are the symbolic cluster labels. The cluster ensemble problem is formalized as a combinatorial optimization problem that obtains a consensus function in terms of shared mutual information among individual solutions. Three effective and efficient techniques for obtaining high-quality consensus functions are described and studied empirically for the following qualitatively different application scenarios: (i) where the original clusters were formed based on non-identical sets of features, (ii) where the original clustering algorithms were applied to non-identical sets of objects and (iii) when the individual solutions provide varying numbers of clusters. Promising results are obtained in all the three situations for synthetic as well as real data sets, even under severe restrictions on data and knowledge sharing. These results motivate the further exploration of consensus techniques for a wider class of distributed data mining applications. Keywords: distributed clustering, privacy-preserving data mining, cluster ensembles, knowledge reuse, consensus functions, mutual Information, distributed data mining.

Design of Distributed Data Mining Applications on the KNOWLEDGE GRID
Mario Cannataro, Domenico Talia, Paolo Trunfio
Abstract: Many industrial, scientific, and commercial applications need to analyze large data sets maintained over geographically distributed sites. The geographic distribution and the large amount of data involved often oblige designers to use distributed and parallel systems. The Grid can play a significant role in providing an effective computational support for distributed data mining and knowledge discovery applications. This paper introduces the KNOWLEDGE GRID, a Grid-based software system for geographically distributed knowledge discovery applications. The main system components are described and the design and implementation steps of distributed data mining applications on Grids using these components are discussed.

Photonic Data Services: Integrating Data, Network and Path Services to Support Next Generation Data Mining Applications
Robert L. Grossman, Yunhong Gu, Dave Hanley, Xinwei Hong, Jorge Levera and Marco Mazzucco, Dave Lillethun, Joe Mambretti, and Jeremy Weinberger
Abstract We describe an architecture for next generation, distributed data mining systems which integrates data services to facilitate remote data analysis and distributed data mining, network protocol services for high performance data transport, and path services for optical paths. We also present experimental evidence using geoscience data that this architecture scales the remote analysis of Gigabyte size data sets over long haul, high performance networks.

Mining Frequent Patterns in Data Streams at Multiple Time Granularities
Chris Giannella, Jiawei Han, Jian Pei, Xifeng Yan, Philip S. Yu
Abstract: Although frequent-pattern mining has been widely studied and used, it is challenging to extend it to data streams. Compared with mining a static transaction data set, the streaming case has far more information to track and far greater complexity to manage. Infrequent items can become frequent later on and hence cannot be ignored. The storage structure need be dynamically adjusted to reflect the evolution of itemset frequencies over time. In this paper, we propose an approach based on computing and maintaining all the frequent patterns (which is usually more stable and smaller than the streaming data) and dynamically updating them with the incoming data stream. We extended the framework to mine time-sensitive patterns with approximate support guarantee. We incrementally maintain tilted-time windows for each pattern at multiple time granularities. Interesting queries can be constructed and answered under this framework. Moreover, inspired by the fact that the FP-tree provides an effective data structure for frequent pattern mining, we develop FP-stream, an FP-tree-based data structure for maintaining time sensitive frequency information about patterns in data streams. The FP-stream can scanned to mine frequent patterns over multiple time granularities. An FP-stream structure consists of (a) an in-memory frequent pattern-tree to capture the frequent and sub-frequent itemset information, and (b) a tilted-time window table for each frequent pattern. Efficient algorithms for constructing, maintaining and updating an FP-stream structure over data streams are explored. Our analysis and experiments show that it is realistic to maintain an FP-stream in data stream environments even with limited main memory.

Efficient Data-Reduction Methods for On-Line Association Rule Discovery
Herve Bronnimann, Bin Chen, Manoranjan Dash, Peter Haas, Yi Qiao, Peter Scheuermann
Abstract: Classical data mining algorithms require one or more computationally intensive passes over the entire database and can be prohibitively slow. One effective method for dealing with this ever-worsening scalability problem is to run the algorithms on a small sample of the data. We present and empirically compare two data-reduction algorithms for producing such a sample; these algorithms, called FAST and EA, are tailored to count data applications such as association-rule mining. The algorithms are similar in that both attempt to produce a sample whose distance appropriately defined from the complete database is minimal. They differ greatly, however, in the way they greedily search through the exponential number of possible samples. FAST, recently developed by Chen et al., uses random sampling together with trimming of outlier transactions. On the other hand, the EA algorithm, introduced in this chapter, repeatedly and deterministically halves the data to obtain the final sample. Unlike FAST, the EA algorithm provides a guaranteed level of accuracy. Our experiments show that EA is more expensive to run than FAST, but yields more accurate results for a given sample size. Thus, depending on the specific problem under consideration, the user can trade off speed and accuracy by selecting the appropriate method. We conclude by showing how the EA data-reduction approach can potentially be adapted to provide data-reduction schemes for streaming data systems. The proposed schemes favor recent data while still retaining partial information about all of the data seen so far.

Discovering Recurrent Events in Multichannel Data Streams Using Unsupervised Methods
Milind R. Naphade, Chung-Sheng Li, Thomas S. Huang
Abstract: Detection of recurrent temporal patterns in digital media is a first step in the next generation data mining for media content. Production videos such as news, sports and movies have a definitive structure that involves short term interaction as well as long term correlation. This structure in video can be captured by models that take into consideration the short term statistics as well as long term recurrence. We investigate the application of probabilistic models that capture this structure. The novel approach is to characterize the short term events in video by models that can account for temporal support in terms of piece-wise stationary signals with transitions. These short term events can then be embedded within another temporal model that accounts for transitions between these event and thus characterizes long term history. This also leads to the detection of recurring events in video using a monolithic model. The proposed approach is an unsupervised algorithm for event detection and it can be used for summarization, similarity based matching and enhanced browsing. With certain extensions similar algorithms can be used for data mining for temporal patterns in other domains such as bio-surveillance, where the multimodal sensor streams may be comprised of traditional and non-traditional data sources.

Counter-Terrorism, Privacy, and Data Mining

Data Mining for Counter-Terrorism
Bhavani Thuraisingham
Abstract: Data mining is becoming a useful tool for detecting and preventing terrorism. This paper first discusses some technical challenges for data mining as applied for counterterrorism applications. Next it provides an overview of the various types of terrorist threats and describes how data mining techniques could provide solutions to counterterrorism. Finally some privacy concerns and potential solutions that could detect terrorist activities and yet attempt to maintain privacy will be discussed.

Biosurveillance and Outbreak Detection
Paola Sebastiani, Kenneth D. Mandl
Abstract: Faced with the very real threat of bioterrorism, the critical need for early detection of an outbreak has shortened the time frame for major enhancements to our public health infrastructure. The early detection of covert biological attacks requires real time data streams revealing of the health of the population, as well as novel methods to detect abnormalities.

MINDS - Minnesota Intrusion Detection System
Levent Ertoz, Eric Eilertson, Aleksandar Lazarevic, Pang-Ning Tan, Vipin Kumar, Jaideep Srivastava, Paul Dokas
Abstract: This paper introduces the Minnesota Intrusion Detection System (MINDS), which uses a suite of data mining techniques to automatically detect attacks against computer networks and systems. While the long-term objective of MINDS is to address all aspects of intrusion detection, this paper focuses on two specific contributions: (i) an unsupervised anomaly detection technique that assigns a score to each network connection that reflects how anomalous the connection is, and (ii) an association pattern analysis based module that summarizes those network connections that are ranked highly anomalous by the anomaly detection module. Experimental results on live network traffic at the University of Minnesota show that our anomaly detection techniques are very promising and are successful in automatically detecting several novel intrusions that could not be identified using popular signature-based tools such as SNORT. Furthermore, given the very high volume of connections observed per unit time, association pattern based summarization of novel attacks is quite useful in enabling a security analyst to understand and characterize emerging threats.

Marshalling Evidence Through Data Mining in Support of Counter Terrorism
Daniel Barbara, James J. Nolan, David Schum, Arun Sood
Abstract: In this chapter we present an architecture to manage a large, distributed volume of evidence for counterterrorism applications. This approach facilitates the intelligence analyst’s train of thought by enabling hypotheses to be postulated and evaluated against the available evidence. The hypothesis creation is left to the human (as we believe only humans possess the ¤exibility to adapt hypothesis to the dynamic nature of the problem), but the system automates the gathering and linking of evidence that supports or negates the hypothesis. We do this by extensive use of data mining techniques in support of the process of query answering. The approach is incorporated into a distributed agent system that allows users to discover and compose agents for hypothesis processing.

Relational Data Mining with Inductive Logic Programming for Link Discovery
Raymond J. Mooney, Prem Melville, Lappoon Rupert Tang, , Jude Shavlik, Ines de Castro Dutra, David Page, Vtor Santos Costa
Abstract: Link discovery (LD) is an important task in data mining for counter-terrorism and is the focus of DARPA’s Evidence Extraction and Link Discovery (EELD) research program. Link discovery concerns the identification of complex relational patterns that indicate potentially threatening activities in large amounts of relational data. Most data-mining methods assume data is in the form of a feature-vector (a single relational table) and cannot handle multi-relational data. Inductive logic programming is a form of relational data mining that discovers rules in first-order logic from multi-relational data. This paper discusses the application of ILP to learning patterns for link discovery.

Defining Privacy for Data Mining
Chris Clifton, Murat Kantarcoglu, Jaideep Vaidya
Abstract: Privacy preserving data mining getting valid data mining results without learning the underlying data values has been receiving attention in the research community and beyond. It is unclear what privacy preserving means. This paper provides a framework and metrics for discussing the meaning of privacy preserving data mining, as a foundation for further research in this field.

Scientific Data Mining

Mining Temporally-Varying Phenomena in Scientific Datasets
R. Machiraju, S. Parthasarathy, J. Wilkins, D. Thompson, B. Gatlin, D. Richie, T. Choy, M. Jiang, S. Mehta, M. Coatney, S. Barr, K. Hazzard
Abstract: Simulation is enhancing and, in many instances, replacing experimentation as a means to gain insight into complex physical phenomena. Recent advances in computer hardware and numerical methods have made it possible to simulate physical phenomena at very fine temporal and spatial resolutions. Unfortunately, given the enormous sizes of the datasets involved, analyzing datasets produced by these simulations is extremely challenging. In order to more fully exploit simulation, the analysis of these large datasets must advance beyond current techniques that are based on interactive visualization. We outline our vision for one such approach and describe progress on a unified framework that promises to provide a novel method to explore large simulation data sets. We illustrate its application to two disparate science drivers – temporally varying solid and fluid systems. In both applications, there are hidden hierarchies of features as well as many abstract multidimensional feature characterizations (e.g. shapes). Through this framework, we offer a systematic approach to detect, characterize, and track meta-stable features as well as formulate hypotheses about their evolution – an important step in extracting vital information from such complex systems.

Methods for Mining Protein Contact Maps
Mohammed J. Zaki, Jingjing Hu, and Chris Bystroffz
Abstract The 3D conformation of a protein may be compactly represented in a symmetrical, square, boolean matrix of pairwise, inter-residue contacts, or contact map. The contact map provides a host of useful information about the protein’s structure. In this paper we describe how data mining can be used to extract valuable information from contact maps. For example, clusters of contacts represent certain secondary structures, and also capture non-local interactions, giving clues to the tertiary structure. We show that using contact maps and a hybrid mining approach, we can construct contact rules to predict the structure of an unknown protein. Furthermore, we mine non-local frequent dense contact patterns that discriminate physical from non-physical maps. We cluster these patterns based on their similarities and evaluate the clustering quality, and show that our techniques are effective in characterizing contact patterns across different proteins.

Mining Scientific Data Sets using Graphs
Michihiro Kuramochi, Mukund Deshpande, and George Karypis
Abstract: As data mining techniques are being increasingly applied to scientific and other nontraditional domains, it is becoming apparent that existing approaches for modeling the characteristics of the underlying physical phenomena and processes are limiting as they cannot model the requirement of these domains. An alternate way is to use graphs to capture in a single and unified framework many of the spatial, topological, geometric, and other types of relational characteristics present in such datasets. The added power provided by graph-modeling can only be realized if computationally efficient and scalable algorithms for many of the classical data-mining tasks, such as frequent pattern discovery, clustering, and classification, become available. In this chapter we present some of our research on developing algorithms for finding frequently occurring patterns and building predictive models for graph datasets.

Challenges in Environmental Data Warehousing and Mining
Nabil R. Adam, Vijayalakshmi Atluri, Dihua Guo, Songmei Yu
Abstract: In the area of Environmental and Earth sciences, we are concerned with the collection, assimilation, cataloging and dissemination or retrieval of a vast array of environmental data, which include satellite imagery of different resolutions captured by different sensors, aerial photos, radar and other monitoring networks, models of the topography, spatial attributes of the landscape such as roads, rivers, parcels, schools, zip code areas, city streets and administrative boundaries, census information, digital terrain models, etc. In this paper, we describe the complex nature of the queries posed to such databases and argue that data warehousing and data mining techniques must be employed to efficiently process these queries. We then identify the research challenges in environmental data warehousing and mining. In particular, we identify research related to data model and OLAP operations, as well as the challenges in extending traditional data mining techniques to to mine environmental data.

Trends in Spatial Data Mining
Shashi Shekhar, Pusheng Zhang, Yan Huang, Ranga Raju Vatsavai
Abstract Spatial data mining is the process of discovering interesting and previously unknown, but potentially useful patterns from large spatial datasets. Extracting interesting and useful patterns from spatial datasets is more difficult than extracting the corresponding patterns from traditional numeric and categorical data due to the complexity of spatial data types, spatial relationships, and spatial autocorrelation. This chapter focuses on the unique features that distinguish spatial data mining from classical data mining. Major accomplishments and research needs in spatial data mining research are discussed.

Challenges in Scientific Data Mining: Heterogeneous, Biased, and Large Samples
Zoran Obradovic and Slobodan Vucetic
Abstract: With recent innovations in data collection technologies in many emerging scientific applications there is a need for a paradigm shift from a traditional hypothesize-andtest process to a partial automation of hypothesis generation, model construction and experimentation. To develop appropriate knowledge discovery tools in support of the paradigm shift in sciences numerous data mining challenges should be solved. The next generation data mining challenges considered in this article include (1) mining of heterogeneous data; (2) learning from biased labeled samples; (3) enhancing predictive modeling by exploiting large unlabeled databases; and (4) reducing large datasets with a controllable information loss for a more efficient mining in a distributed environment. Several effective methods for addressing these problems developed at our laboratory are described. Applications are illustrated on six scientific and engineering problems in precision agriculture, electric power systems and biochemistry studied in our collaboration with domain experts. It is demonstrated that some fairly general data mining algorithmswith appropriatemodifications can be adopted for a more successful mining of spatial observations, nonstationary time series and heterogeneous sequence databases in specific scientific applications.

Web, Semantics and Data Mining

Web Mining - Concepts, Applications & Research Directions
Jaideep Srivastava, Prasanna Desikan, Vipin Kumar
Abstract: From its very beginning, the potential of extracting valuable knowledge from the Web has been quite evident. Web mining, i.e. the application of data mining techniques to extract knowledge from Web content, structure, and usage, is the collection of technologies to fulfill this potential. Interest in Web mining has grown rapidly in its short history, both amongst researchers and practitioners. This chapter provides a brief overview of the accomplishments of the field, both in terms of technologies and applications, and outlines key future research directions.

Advancements in Text Mining Algorithms and Software
Svetlana Y. Mironova, Michael W. Berry, Scott Atchley, Micah Beck, Tianhao Wu, Lars E. Holzman, William M. Pottenger, Daniel J. Phelps
Abstract: In this chapter, we present two advancements in the development of algorithms and software for the mining of textual information. For large-scale indexing needs, we present the General Text Parser (GTP) software environment with network storage capability. This object-oriented software (C++, Java) is designed to provide information retrieval (IR) and data mining specialists the ability to parse and index large text collections. GTP utilizes Latent Semantic Indexing (or LSI) for the construction of a vector space IR model. Users can choose to store the files generated by GTP on a remote network in order to overcome local storage restrictions and facilitate file sharing. For feature extraction from textual information, we present a supervised learning algorithm for the discovery of finite state automata, in the form of regular expressions, in textual data. The automata generate languages that consist of various representations of features useful in information extraction. This learning technique has been used in the extraction of textual features from police incident reports and patent data.

On Data Mining, Semantics, and Intrusion Detection: What to dig for and Where to find it
Anupam Joshi and Jeffrey L. Undercoffer
Abstract: We examine the intersection of data mining with the semantic web, in particular with ontologies and reasoning. We show how the two can operate in synergy to better distributed data mining, in particular Distributed Intrusion Detection. We have developed a target-centric ontological framework that models computer attacks. Our ontology is based upon an analysis of over 4,000 classes of computer attacks and their corresponding attack strategies. We are using data mining techniques to identify and specify instances of the ontology. We illustrate the utility of an ontology with an example of the Mitnick Attack, an attack that combine a denial of service and TCP hijacking in a distributed manner. We then define an Intrusion Detection System model that employs data mining techniques for initial hypothesis testing and then uses semantic reasoning to further test events that are initially identified as falling outside the bounds of normal system activity.

Usage Mining for and on the SemanticWeb
Bettina Berendt, Gerd Stumme, and Andreas Hotho
Abstract: Semantic Web Mining aims at combining the two fast-developing research areas SemanticWeb andWeb mining. Web mining aims at discovering insights about themeaning of Web resources and their usage. Given the primarily syntactical nature of data Web mining operates on, the discovery of meaning is impossible based on these data only. Therefore, formalizations of the semantics of Web resources and navigation behavior are increasingly being used. This fits exactly with the aims of the SemanticWeb: the SemanticWeb enriches theWWWby machine-processable information which supports the user in his tasks. In this paper, we discuss the interplay of the Semantic Web with Web mining, with a specific focus on usage mining.

Data Mining: Next Generation Challenges and Future Directions

H. Kargupta, A. Joshi, K. Sivakumar, and Y. Yesha MIT/AAAI Press

Pervasive, Distributed, and Stream Data Mining

Counter-Terrorism, Privacy, and Data Mining

Scientific Data Mining

Web, Semantics and Data Mining

Pervasive, Distributed, and Stream Data Mining

Counter-Terrorism, Privacy, and Data Mining

Scientific Data Mining

Web, Semantics and Data Mining

H. Kargupta, A. Joshi, K. Sivakumar, and Y. Yesha
MIT/AAAI Press