http://arnetminer.org/LDMTA2009 [CFP in PDF] [CFP in txt]
in conjunction with ICDM2009, December 6‐9, 2009, Miami, FL,USA
Theme and Topics
Due to the explosion of various data, there has been an increasing demand for scalable machine learning and data mining algorithms in many applications, such as social network analysis, information retrieval, recommendation system, biology applications, multimedia, e-commerce and so on. Graph-based algorithms and graph mining are one of the pioneer approaches to successfully address the scalability issues for large-scale applications. Nevertheless, it is not clear how the majority of machine learning and data mining approaches can be transferred to handle the problem in a systematic way. In particular, advanced techniques, such as graphical models, sparse optimization and Bayesian approaches, are far from practical uses for scenarios with millions of examples. In this workshop, we are interested in investigating the scalability and efficiency of existing machine learning and data mining algorithms with respect to both theoretical and experimental perspectives. We seek papers in the following topics:
Invited Speaker
Christos Faloutsos, Professor
Carnegie Mellon University
Title: Large Graph Mining
Abstract: How do graphs look like? How do they evolve over time? How can we generate realistic-looking graphs? What properties do we see in large graphs, that are not visible in smaller graphs? We review some static and temporal 'laws', and we describe some recent generators which naturally match all of the known properties of real graphs. Moreover, we present tools for discovering anomalies and patterns in two types of graphs, static and time-evolving. We show how to use 'hadoop' to mine large graphs, and we report results from a web snapshot of over 100Gb size, using the M45 cluster of Yahoo. M45 is one of the top 50 supercomputers, with 4,000 processors, three terabytes of memory, and 1.5 petabytes of disks.
Biography: Christos Faloutsos is a Professor at Carnegie Mellon University. He has received the Presidential Young Investigator Award by the National Science Foundation (1989), the Research Contributions Award in ICDM 2006, fifteen ``best paper'' awards, and several teaching awards. He has served as a member of the executive committee of SIGKDD; he has published over 170 refereed articles, 11 book chapters and one monograph. He holds five patents and he has given over 20 tutorials and over 10 invited distinguished lectures. His research interests include data mining for streams and networks, fractals, indexing for multimedia and bio-informatics data, and database performance.
Tamara G. Kolda, Principal Member of Technical Staff
Sandia National Laboratories
Title: Scalable Tensor Factorizations with Incomplete Data
Abstract: The problem of incomplete data---i.e., data with missing or unknown values---in multi-way arrays is ubiquitous in biomedical signal processing, network traffic analysis, bibliometrics, social network analysis, chemometrics, computer vision, communication networks, etc. We consider the problem of how to factorize data sets with missing values with the goal of capturing the underlying latent structure of the data and possibly reconstructing missing values (i.e., tensor completions). We focus on one of the most well-known tensor factorizations that captures multi-linear structure, CANDECOMP/PARAFAC (CP). In the presence of missing data, CP can be formulated as a weighted least squares problem that models only the known entries. We develop an algorithm called CP-WOPT (CP Weighted OPTimization) that uses a first-order optimization approach to solve the weighted least squares problem. Based on extensive numerical experiments, our algorithm is shown to successfully factorize tensors with noise and up to 99% missing data. A unique aspect of our approach is that it scales to sparse large-scale data, e.g., 1000 x 1000 x 1000 with one million known entries (0.1% dense). To show the real-world usefulness of CP-WOPT, we illustrate its applicability on a novel EEG (electroencephalogram) application where missing data is frequently encountered due to disconnections of electrodes and also on the problem of modeling network traffic where data may be absent due to collection errors. This is joint work with Evrim Acar, Daniel M. Dunlavy, and Morten Morup.
Biography: Dr. Tamara Kolda is a Principal Member of Technical Staff in the Mathematics, Informatics, and Decision Sciences department at Sandia National Laboratories in Livermore, California. She has received several awards including a 2003 Presidential Early Career Award for Scientists and Engineers (PECASE). Dr. Kolda is well-known in multilinear algebra for her work on the Tensor Toolbox for MATLAB and the recent publication of the article "Tensor Decomposition and Applications" in SIAM Review. She also co-authored a paper on memory-efficient Tucker (MET) tensor decompositions that resulted in the Best Paper Prize in the Theoretical/Algorithms Category at the 2008 IEEE International Conference on Data Mining (ICDM'08). Dr. Kolda is an associate editor for the SIAM Journal on Scientific Computing and chair for the SIAM Activity Group on Computational Science and Engineering.
Important Date
Submissions
The best papers will be invited to submit their extensions to the TKDD special issue on Large-scale Data Mining: Theory and Applications. Please prepare your paper not more than 10 pages in PDF file, with IEEE camera‐ready template: http://wi-lab.com/cyberchair/icdm09/scripts/submit.php.
All papers must be submitted in Adobe Portable Document Format (PDF). Please ensure that any special fonts used are included in the submitted documents. Please use the following link to submit your paper here. If you cannot submit there, please send to us by email <liuya@us.ibm.com>.
Workshop Co-Chairs
Program Committee
Contact us
Yan Liu, IBM TJ Watson Research Center, liuya@us.ibm.com, 1-914-945-2128
Program
|
9:00-9:15am Opening
|
|
|
|
9:15-10:00am Keynote1
|
|
|
|
Carnegie Mellon University
|
|
Title: Large Graph Mining
|
|
|
|
10:05-10:30am Break
|
|
|
|
10:30-12:30am Technical Presentation
|
|
|
|
Regular presentation
|
|
|
|
- Wenxuan Gao, Robert Grossman, Philip yu, and Yunhong Gu, "Why naive ensemble does not work in cloud computing"
|
|
- Ying Chen, Scott Spangler, Jeffrey Kreulen, Stephen Boyer, Thomas Griffin, Alfredo Alba, Amit Behal, Ana Lelescu, Bin He, Linda Kato, Cheryl Kieliszewski, Xian Wu, and Li Zhang, "SIMPLE: A Strategic Information Mining PLatform for IP Excellence"
|
|
- Yuk Wah Wong, Dominic Widdows, Tom Lokovic, and Kamal Nigam, "Scalable Attribute-Value Extraction from Semi-Structured Text"
|
|
|
|
Short presentation
|
|
|
|
- Shengqi Yang, Bin Wu, Haizhou Zhao, Qi Ye, and Bai Wang, "Efficient Dense Structure Mining using MapReduce"
|
|
- Ramesh Natarajan, Vikas Sindhwani, and Shirish Tatikonda, "Parallel, Sparse Least Squares Methods in Document Classification"
|
|
- Peerapon Vateekul and Miroslav Kubat, "Fast Induction of Multiple Decision Trees in Text Categorization From Large Scale, Imbalanced, and Multi-label Data"
|
|
- Zuobing Xu, Chris Hogan, and Robert Bauer, "Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm"
|
|
- Daniel Gillblad, Diogo Ferreira, and Rebecca Steinert, "Estimating the Parameters of Randomly Interleaved Markov Models"
|
|
|
|
12:30-1:30pm Lunch
|
|
|
|
1:30-2:15pm Keynote2
|
|
|
|
Principal Member of Technical Staff
|
|
Sandia National Laboratories
|
|
Title: Scalable Tensor Factorizations with Incomplete Data
|
|
|
|
2:15-4:00pm Technical Presentation
|
|
|
|
Regular presentation
|
|
|
|
- Tianyuan Chen, Lei Chang, Jianqing Ma, Wei Zhang, and Feng Gao, "HOCT: A Highly Scalable Algorithm for Training Linear CRF on Modern Hardware" [PPT]
|
|
- Evrim Acar, Tamara Kolda, and Daniel Dunlavy, "Link Prediction on Evolving Data using Matrix and Tensor Factorizations"
|
|
- B. Aditya Prakash, Ashwin Sridharan, Mukund Seshadri, Sridhar Machiraju, and Christos Faloutsos, "EigenSpokes: Surprising Patterns and Scalable Community Chipping in Large Graphs"
|
|
- Dennis Wegener, Michael Mock, Deyaa Adranale, and Stefan Wrobel, "Toolkit-based high-performance Data Mining of large Data on MapReduce Clusters"
|
|
|
|
Short Presentation
|
|
|
|
- Bin Zhao and Changshui Zhang, "Compressed Spectral Clustering"
|
|
- Jia-Ching Ying, Vincent S. Tseng, and philip yu, "Efficient Incremental Mining of Qualified Web Traversal Patterns without Scanning Original Databases"
|
|
|
|
4:00-4:30pm Break
|
|
|
|
4:30-5:45pm Panel
|
|
|
|
5:45-6:00pm Concluding remarks
|
Maintained by ArnetMiner.Org.