Parallelization of Data Mining Algorithms
With growing data, and importance of data-centric approach to decision making, it is important to design data mining algorithms that work for HPC architectures such as distributed memory, shared memory or hybrid architectures. We have designed (still designing) parallel versions of various data clustering algorithms such as density based clustering (DBSCAN, OPTICS), hierarchical clustering (SLINK), subspace clustering (ENCLUS, MAFIA, PROCLUS), shared nearest neighbors (SNN), etc. that work for the above parallel architectures. The proposed algorithms are found to give better performance than the state of the art techniques.
Data Distribution Strategies for Parallel Data Mining Algorithms
For getting best performance for parallel data mining algorithms that work on distributed memory architectures, an efficient data distribution scheme is essential to achieve load balancing in terms of the computational load. This data distribution scheme has to be specific to each kind of algorithm being executed. We have designed (still designing) such distribution schemes for distributing large datasets over a cluster of computing nodes to achieve optimal load balancing, so as to maximize the performance of various spatial data mining algorithms.
Anytime Mining of Data Streams
With increase in popularity of deployment of data generating devices, there is a need to develop stream mining algorithms that process streams arriving at varying inter-arrival rate. These algorithms, at the same time, should be capable of processing multiple streams by leveraging a high performance computing architecture. We have developed (still developing) anytime stream mining algorithms for various data mining tasks such as clustering, classification, frequent itemset mining, etc. These algorithms not only handle variable stream speeds, but are also capable of producing an immediate approximate mining result when user requests, and can improve the quality of the result with increase in time allowance.
Data Structure for Data Mining
We have developed (still developing) tailor-made indexing structures that enhance the performance of various spatial queries like neighborhood and nearest neighbor queries, which are commonly used in spatial data mining algorithms like DBSCAN, OPTICS, SNN, SLINK, K-NN classifier, etc. The proposed data structures give better query performance than the conventional data structures like R-tree & kd-tree.
Domain Specific Language and Compiler for Parallel Data Mining Algorithms
We have developed a domain specific language known as DWARF, specifically for data clustering algorithms. It supports language constructs for efficient design and rapid prototyping of various clustering algorithms such as density based clustering (DBSCAN, OPTICS RECOME), subspace clustering (ENCLUS, MAFIA, PROCLUS, etc.), partitioning based clustering (K-means, EM Clustering, K-medoids, etc.), hierarchical clustering (SLINK, CLINK, ALINK), etc. Along with the language, we have designed a compiler that automatically parallelizes a sequential code written in DWARF to work for HPC architectures such as distributed memory and shared memory. The parallel code generated by the compiler gives at par performance with the state of the art parallel algorithms. Currently we are developing a DSL and a compiler for classification algorithms. We are also working upon development of a virtual machine to make DWARF independent of the platform.
Social Media Analytics
With increase in popularity of usage of social media platforms such as twitter, facebook, analyzing data from social media is becoming increasingly popular. We are working on twitter streams, more specifically on problems related to data visualization, event detection, etc.
Genome Sequence Assembly
We are also working on genome assembly problem, which refers to aligning and merging of fragments of longer DNA sequences in order to reconstruct the original sequence. We are working on building efficient ways of doing it while leveraging the performance gain achieved by HPC architectures.