LDMTA2011: Large-scale Data Mining: Theory and Applications (LDMTA 2011)

http://arnetminer.org/LDMTA2011  

The 3rd Workshop on Large-scale Data Mining: Theory and Applications (LDMTA 2011)

in conjunction with SIGKDD2011, August 21-24, 2011, San Diego, CA, USA. 

Program -- August 21st, 2011

8:00 am - 8:45 am

                                    

Invited presentation

Expressing and Running Big Data Analytics

Berthold Reinwald, IBM Almaden Research Center

Abstract:

Analytics on Big Data are ubiquitous in virtually every industry ranging from test analysis in manufacturing, information correlation in health sciences, churn analysis in telecommunications, fraud analysis in finances, and traffic analysis in transportation. While data analysts may choose from many platforms for analyzing small datasets, analysis of data at the scale of TBytes/PBytes is challenging. In this talk, we discuss some of the challenges in building large-scale analytical systems and describe our design principles and goals. We share our approach of implementing declarative machine learning on MapReduce, and share some of our lessons learned.

Bio:

Dr. Berthold Reinwald is a Research Staff Member in the database group at IBM's Almaden Research Center. His current research areas include cloud database technology and scalable analytics platforms. Dr. Reinwald currently leads a research team on large scale machine learning focusing on building platforms to enable the implementation of algorithms to run on large amounts of data. Dr. Reinwald holds a Ph.D. in Computer Science from the University of Erlangen-Nuernberg,Germany.

8:45 am - 9:15 am

Paper presentation

Mavuno: A Scalable and Effective Hadoop-Based Paraphrase Harvesting System

Donald Metzler and Eduard Hovy

9:15 am - 10:00 am

Invited presentation

Large-Scale Social Data Mining Challenge in Facebook

Rong Yan, Facebook

Abstract:

Facebook has grown into one of the largest websites on the world today with over 700 million users. Many unique technological challenges and related applications have arisen at this scale. In this talk, I will overview a number of Facebook algorithms and applications which mine this massive social data streams for different purposes, such as identifying popular trend in Facebook, detect user happiness, discover latent topics of Facebook users, etc. Moreover, I will also describe other related Facebook applications in the domain of rich media, such as face detection / recognition.

Bio:

Dr. Rong Yan is currently a Research Scientist in Facebook. He was a Research Staff Member in the IBM T. J. Watson Research Center from 2006 to 2009. Dr. Yan received his M.Sc. (2004) and Ph.D. (2006) degree from Carnegie Mellon University's School of Computer Science. His research interests include large-scale machine learning, data mining, ads optimization, social media and multimedia information retrieval. Dr. Yan received the Best Paper runner-Up awards in ACM Multimedia 2004 and ACM CIVR 2007. He has received the IBM Research External Recognition Award in 2007. He led the technical efforts for building the face detection service in Facebook, and also designed the automatic video retrieval system that achieves the best performance in the world-wide TRECVID evaluation in 2003 / 2005. Dr. Yan has authored or co-authored 5 book chapters and more than 60 international conference and journal papers. Dr. Yan has served or is serving as co-chairs for 10 conferences / workshops and as a Program Committee member in more than 40 ACM / IEEE conferences. He has served in the NSF proposal review panel and as reviewers for several other research councils. Dr. Yan gave tutorials and guest lectures at several major conferences and universities.
10:00 am - 10:30 am Break
10:30 am - 11:00 am

Paper presentation

Distributed Tuning of Machine Learning Algorithms using MapReduce Clusters

Yasser Ganjisaffar, Thomas Debeauvais, Sara Javanmardi, Rich Caruana and Cristina Lopes

11:00 am - 11:30 am

Paper presentation

A Case-Study on Learning from Large-scale Intracranial EEG Data using Multi-core Machines and Clusters

Haimonti Dutta, Huascar Fiorletta , Manoj Pooleery, Hatim Diab, Stanley German, and David Waltz

11:30 am - 12:00 pm

Paper presentation

Categorization of Display Ads using Image and Landing Page Features

Andrew Kae, Kin Kan, Vijay Narayanan and Dragomir Yankov

12:00 pm - 1:30 pm Break
1:30 pm - 2:15 pm

Invited presentation

Large Scale Estimation and Forecasting in Practice

Spiros Papadimitriou and  Ahmed Metwally, Google

Abstract:

Processing and analyzing large volumes of data provides both opportunities and challenges.  In this talk we will describe practical considerations that
arise in the context of real-world applications, particularly in estimation and prediction. Although accuracy is important, in practice other factors such as robustness, timeliness, and ease of deployment (which are negative externalities wrt. accuracy) need to be considered as well. We will describe problems motivated by applications that need to analyze large volumes of data, and describe experiences that are not obvious, but neither surprising if viewed in the appropriate context.

Spiros Papadimitriou's main interests span data mining for graphs and streaming data, clustering, time series, systems for large-scale data processing, and mobile applications. He has published more than forty papers in refereed conferences and journals. He has several invited journal publications, talks, tutorials, and book chapters and has filed multiple patents.  He was a Siebel scholarship recipient and received the SDM best paper award.  He obtained his BSc in Computer Science from the University of Crete, Heraclion and his MSc and PhD degrees from Carnegie Mellon University. He is currently at  Google Research.  Before joining Google, he was a research staff member at IBM T.J. Watson.

Ahmed Metwally's research focuses on new data management and data mining applications. Ahmed has adopted data stream mining, and large scale data analysis techniques as comprehensive approaches for understanding the Internet traffic. His goal is to combat Internet abusive traffic while not violating the surfers' privacy. Ahmed Metwally received his M.S. and Ph.D. degrees in computer science from the University of California at Santa Barbara, and his B.S. degree in computer and systems engineering from Alexandria University. Ahmed is currently with the Ad Traffic Quality Team at Google.

 

2:15 pm - 3:00 pm

Invited presentation

 (Some) Lessons Learned in Large-scale Predictive Modeling for Ads Quality

Sugato Basu, Google

Abstract:

The goal of Ads Quality at Google is providing a good experience for users on Search Ads, while generating revenue for Google and ROI for advertisers. Ads Quality is an application area rich in many large-scale data mining problems. This talk will focus on some of those problems in Ads Quality, e.g., estimating ad bounce rate, predicting creative quality and landing page quality, estimating relevance and examination probabilities of ads. Many of these problems require predictive modeling at a very large scale, often involving billions of features and millions of users. This talk will discuss a few practical lessons learned by the speaker while working on these problems at Google.

Bio:

Sugato is a Staff Research Scientist at Google Research. His areas of research interest include machine learning, data mining, predictive modeling and optimization, with special emphasis on scalable algorithm design for text and social network analysis. He did his Ph.D. in machine learning from UT Austin, and worked at SRI International on the CALO project before joining Google Research. He has written multiple papers, book chapters, and encyclopedia articles on clustering, semi-supervised learning, record linkage, social search and routing, rule mining, and optimization, and has won best paper awards at the KDD, ICML and SDM conferences.

 

3:00 pm - 3:30 pm Break
3:30 pm - 4:15 pm

Invited presentation

Opportunities and Dangers in Large Scale Data Intensive Computing

Douglas Thain, University of Notre Dame

Abstract:

Everyone now has easy access to thousands of machines by plugging into systems such as clusters, clouds, and grids. Unfortunately, this also means it is very easy for the unwitting to unleash a storm of computation that breaks the bank, the computer system, or both.  In this talk, I will present an overview of several ways of safely organizing large scale computations, and present some of our experience in working with large scale applications in fields such as bioinformatics, data mining, and image processing. I will also present what I see as the current limitations and dangers of using such systems, in order to guide the design of future applications.

Bio:

Douglas Thain is an Associate Professor in the Department of Computer Science and Engineering at the University of Notre Dame, where he directs the Cooperative Computing Lab.  Douglas received the B.S. in Physics from the University of Minnesota and the M.S. and Ph.D. in Computer Sciences from the University of Wisconsin, where he contributed to the Condor distributed computing system.  His research team at Notre Dame creates software systems that are used around the world to attack large scale data intensive problems in science and engineering.

4:15 pm - 4:45 pm

Paper presentation

MPI/OpenMP Hybrid Parallel Inference for Latent Dirichlet Allocation

Shotaro Tora and Koji Eguchi

 

Objectives

With advances in data collection and storage technologies, large data sources have become ubiquitous. Today, organizations routinely collect terabytes of data on a daily basis with the intent of gleaning non-trivial insights on their business processes. To benefit from these advances, it is imperative that data mining and machine learning techniques scale to such proportions. Such scaling can be achieved through the design of new and faster algorithms and/or through the employment of parallelism. Furthermore, it is important to note that emerging and future processor architectures (like multi-cores) will rely on user-specified parallelism to provide any performance gains. Unfortunately, achieving such scaling is non-trivial and only a handful of research efforts in the data mining and machine learning communities have attempted to address these scales. 

At the other end of the spectrum, the past few years have witnessed the emergence of several platforms for the implementation and deployment of large-scale analytics. Examples of such platforms include Hadoop (Apache) and Dryad (Microsoft). These platforms have been developed by the large-scale distributed processing community and can not only simplify implementation but also support execution on the cloud making large-scale machine learning and data mining both affordable and available to all. Today, there is a large gap between the data mining/machine learning and the large scale distributed processing communities. To make advances in large-scale analytics it is imperative that both these communities work hand-in-hand. The intent of this workshop is to further research efforts on large-scale data mining and to encourage researchers and practitioners to share their studies and experiences on the implementation and deployment of scalable data mining and machine learning algorithms.


Topics of Interest

  • Application case studies that showcase the need for large-scale machine learning/data mining. Areas of interest of interest include financial modeling, web mining, medical informatics, climate modeling, and mining retail and e-commerce data.
  • Parallel and distributed algorithms for large scale machine learning/data mining, data preprocessing, and cleaning.
  • Exploiting modern and specialized hardware such as multi-core processors, GPUs, STI Cell processor, etc.
  • Memory hierarchy aware data mining/machine learning algorithms.
  • Streaming data algorithms for machine learning and data mining.
  • New platforms and/or programming model proposals for parallel/distributed machine learning and data mining for batch and/or stream domains.
  • Evaluation of platforms (such as Hadoop) and/or programming models (such as map-reduce) for batch and/or stream domains.
  • Performance studies comparing cloud, grid, and cluster implementations
  • Data intensive computing approaches
  • Future research challenges in cloud and data intensive computing

Important dates and guidelines

 

Submission deadline: May 21th, 2011

Notification of acceptance: June 10th, 2011

Final papers due: June 15th, 2011

 

All papers submitted should have a maximum length of 8 pages and must be prepared using the ACM camera‐ready template http://www.acm.org/sigs/pubs/proceed/template.html. Authors are required to submit their papers electronically in PDF format. All submissions should clearly present the author information including the names of the authors, the affiliations and the emails.

Submission site is located at https://www.easychair.org/conferences/?conf=ldmta2011

 Workshop Co-chairs

  • Dr. Chid Apte, IBM Research, apte (at) us.ibm.com
  • Prof. Nitesh Chawla, University of Notre Dame, nchawla (at) cse.nd.edu
  • Dr. Amol Ghoting, IBM Research, aghoting (at) us.ibm.com
  • Prof. Yan Liu, University of Southern California, yanliu.cs (at) usc.edu
  • Dr. Jimeng Sun, IBM Research, jimeng (at) us.ibm.com
  • Prof. Jie Tang, Tsinghua University, China, jietang (at) tsinghua.edu.cn
  • Dr. Ranga Raju Vatsavai, Oak Ridge National Laboratory, vatsavairr (at) ornl.gov

 

Program Committee

  • Shirish Tatikonda, IBM Research
  • Gagan Agrawal, Ohio State University
  • Jeffrey Yu, Chinese University of Hong Kong
  • Alexander Gray, Georgia Tech
  • Prabhanjan Kambadur, IBM Research
  • Rong Yan, Facebook
  • Elad Yom-Tov, Yahoo! Research
  • Mohammed Zaki, Rensselaer Polytechnic Institute
  • Saeed Salem, North Dakota State University
  • Berthold Reinwald, IBM Research
  • Yuan Yu, Microsoft Research
  • Petros Drineas, Rensselaer Polytechnic Institute
  • Misha Bilenko, Microsoft Research
  • Ron Bekkerman, LinkedIn
  • Vijay Narayanan, Yahoo!
  • Milind Bhandarkar, LinkedIn
  • Tina Eliassi-Rad, Rutgers University

 

Steering Committee

  • Prof. Christos Faloutsos, Carnegie Mellon University
  • Prof. Robert Grossman, University of Chicago
  • Prof. Jiawei Han, University of Illinois at Urbana-Champaign

 

Contact information

  • Dr. Amol Ghoting, IBM Research, aghoting (at) us.ibm.com, 1-914-945-2193

To edit this page, please click here. Powered by ArnetMiner.Org