http://arnetminer.org/LDMTA2011
in conjunction with SIGKDD2011, August 21-24, 2011, San Diego, CA, USA.
Program -- August 21st, 2011
|
8:00 am - 8:45 am
|
Invited presentation Expressing and Running Big Data Analytics Berthold Reinwald, IBM Almaden Research Center Abstract: Analytics on Big Data are ubiquitous in virtually every industry ranging from test analysis in manufacturing, information correlation in health sciences, churn analysis in telecommunications, fraud analysis in finances, and traffic analysis in transportation. While data analysts may choose from many platforms for analyzing small datasets, analysis of data at the scale of TBytes/PBytes is challenging. In this talk, we discuss some of the challenges in building large-scale analytical systems and describe our design principles and goals. We share our approach of implementing declarative machine learning on MapReduce, and share some of our lessons learned. Bio: Dr. Berthold Reinwald is a Research Staff Member in the database group at IBM's Almaden Research Center. His current research areas include cloud database technology and scalable analytics platforms. Dr. Reinwald currently leads a research team on large scale machine learning focusing on building platforms to enable the implementation of algorithms to run on large amounts of data. Dr. Reinwald holds a Ph.D. in Computer Science from the University of Erlangen-Nuernberg,Germany. |
| 8:45 am - 9:15 am |
Paper presentation Mavuno: A Scalable and Effective Hadoop-Based Paraphrase Harvesting System Donald Metzler and Eduard Hovy |
| 9:15 am - 10:00 am |
Invited presentation Large-Scale Social Data Mining Challenge in Facebook Rong Yan, Facebook Abstract: Facebook has grown into one of the largest websites on the world today with over 700 million users. Many unique technological challenges and related applications have arisen at this scale. In this talk, I will overview a number of Facebook algorithms and applications which mine this massive social data streams for different purposes, such as identifying popular trend in Facebook, detect user happiness, discover latent topics of Facebook users, etc. Moreover, I will also describe other related Facebook applications in the domain of rich media, such as face detection / recognition. Bio: Dr. Rong Yan is currently a Research Scientist in Facebook. He was a Research Staff Member in the IBM T. J. Watson Research Center from 2006 to 2009. Dr. Yan received his M.Sc. (2004) and Ph.D. (2006) degree from Carnegie Mellon University's School of Computer Science. His research interests include large-scale machine learning, data mining, ads optimization, social media and multimedia information retrieval. Dr. Yan received the Best Paper runner-Up awards in ACM Multimedia 2004 and ACM CIVR 2007. He has received the IBM Research External Recognition Award in 2007. He led the technical efforts for building the face detection service in Facebook, and also designed the automatic video retrieval system that achieves the best performance in the world-wide TRECVID evaluation in 2003 / 2005. Dr. Yan has authored or co-authored 5 book chapters and more than 60 international conference and journal papers. Dr. Yan has served or is serving as co-chairs for 10 conferences / workshops and as a Program Committee member in more than 40 ACM / IEEE conferences. He has served in the NSF proposal review panel and as reviewers for several other research councils. Dr. Yan gave tutorials and guest lectures at several major conferences and universities. |
| 10:00 am - 10:30 am | Break |
| 10:30 am - 11:00 am |
Paper presentation Distributed Tuning of Machine Learning Algorithms using MapReduce Clusters Yasser Ganjisaffar, Thomas Debeauvais, Sara Javanmardi, Rich Caruana and Cristina Lopes |
| 11:00 am - 11:30 am |
Paper presentation Haimonti Dutta, Huascar Fiorletta , Manoj Pooleery, Hatim Diab, Stanley German, and David Waltz |
| 11:30 am - 12:00 pm |
Paper presentation Categorization of Display Ads using Image and Landing Page Features Andrew Kae, Kin Kan, Vijay Narayanan and Dragomir Yankov |
| 12:00 pm - 1:30 pm | Break |
| 1:30 pm - 2:15 pm |
Invited presentation Large Scale Estimation and Forecasting in Practice Spiros Papadimitriou and Ahmed Metwally, Google Abstract:
Processing and analyzing large volumes of data provides both opportunities and challenges. In this talk we will describe practical considerations that Spiros Papadimitriou's main interests span data mining for graphs and streaming data, clustering, time series, systems for large-scale data processing, and mobile applications. He has published more than forty papers in refereed conferences and journals. He has several invited journal publications, talks, tutorials, and book chapters and has filed multiple patents. He was a Siebel scholarship recipient and received the SDM best paper award. He obtained his BSc in Computer Science from the University of Crete, Heraclion and his MSc and PhD degrees from Carnegie Mellon University. He is currently at Google Research. Before joining Google, he was a research staff member at IBM T.J. Watson. Ahmed Metwally's research focuses on new data management and data mining applications. Ahmed has adopted data stream mining, and large scale data analysis techniques as comprehensive approaches for understanding the Internet traffic. His goal is to combat Internet abusive traffic while not violating the surfers' privacy. Ahmed Metwally received his M.S. and Ph.D. degrees in computer science from the University of California at Santa Barbara, and his B.S. degree in computer and systems engineering from Alexandria University. Ahmed is currently with the Ad Traffic Quality Team at Google.
|
| 2:15 pm - 3:00 pm |
Invited presentation (Some) Lessons Learned in Large-scale Predictive Modeling for Ads Quality Sugato Basu, Google Abstract: The goal of Ads Quality at Google is providing a good experience for users on Search Ads, while generating revenue for Google and ROI for advertisers. Ads Quality is an application area rich in many large-scale data mining problems. This talk will focus on some of those problems in Ads Quality, e.g., estimating ad bounce rate, predicting creative quality and landing page quality, estimating relevance and examination probabilities of ads. Many of these problems require predictive modeling at a very large scale, often involving billions of features and millions of users. This talk will discuss a few practical lessons learned by the speaker while working on these problems at Google. Bio: Sugato is a Staff Research Scientist at Google Research. His areas of research interest include machine learning, data mining, predictive modeling and optimization, with special emphasis on scalable algorithm design for text and social network analysis. He did his Ph.D. in machine learning from UT Austin, and worked at SRI International on the CALO project before joining Google Research. He has written multiple papers, book chapters, and encyclopedia articles on clustering, semi-supervised learning, record linkage, social search and routing, rule mining, and optimization, and has won best paper awards at the KDD, ICML and SDM conferences.
|
| 3:00 pm - 3:30 pm | Break |
| 3:30 pm - 4:15 pm |
Invited presentation Opportunities and Dangers in Large Scale Data Intensive Computing Douglas Thain, University of Notre Dame Abstract: Everyone now has easy access to thousands of machines by plugging into systems such as clusters, clouds, and grids. Unfortunately, this also means it is very easy for the unwitting to unleash a storm of computation that breaks the bank, the computer system, or both. In this talk, I will present an overview of several ways of safely organizing large scale computations, and present some of our experience in working with large scale applications in fields such as bioinformatics, data mining, and image processing. I will also present what I see as the current limitations and dangers of using such systems, in order to guide the design of future applications. Bio: Douglas Thain is an Associate Professor in the Department of Computer Science and Engineering at the University of Notre Dame, where he directs the Cooperative Computing Lab. Douglas received the B.S. in Physics from the University of Minnesota and the M.S. and Ph.D. in Computer Sciences from the University of Wisconsin, where he contributed to the Condor distributed computing system. His research team at Notre Dame creates software systems that are used around the world to attack large scale data intensive problems in science and engineering. |
| 4:15 pm - 4:45 pm |
Paper presentation MPI/OpenMP Hybrid Parallel Inference for Latent Dirichlet Allocation Shotaro Tora and Koji Eguchi |
Objectives
With advances in data collection and storage technologies, large data sources have become ubiquitous. Today, organizations routinely collect terabytes of data on a daily basis with the intent of gleaning non-trivial insights on their business processes. To benefit from these advances, it is imperative that data mining and machine learning techniques scale to such proportions. Such scaling can be achieved through the design of new and faster algorithms and/or through the employment of parallelism. Furthermore, it is important to note that emerging and future processor architectures (like multi-cores) will rely on user-specified parallelism to provide any performance gains. Unfortunately, achieving such scaling is non-trivial and only a handful of research efforts in the data mining and machine learning communities have attempted to address these scales.
At the other end of the spectrum, the past few years have witnessed the emergence of several platforms for the implementation and deployment of large-scale analytics. Examples of such platforms include Hadoop (Apache) and Dryad (Microsoft). These platforms have been developed by the large-scale distributed processing community and can not only simplify implementation but also support execution on the cloud making large-scale machine learning and data mining both affordable and available to all. Today, there is a large gap between the data mining/machine learning and the large scale distributed processing communities. To make advances in large-scale analytics it is imperative that both these communities work hand-in-hand. The intent of this workshop is to further research efforts on large-scale data mining and to encourage researchers and practitioners to share their studies and experiences on the implementation and deployment of scalable data mining and machine learning algorithms.
Topics of Interest
Important dates and guidelines
Submission deadline: May 21th, 2011
Notification of acceptance: June 10th, 2011
Final papers due: June 15th, 2011
All papers submitted should have a maximum length of 8 pages and must be prepared using the ACM camera‐ready template http://www.acm.org/sigs/pubs/proceed/template.html. Authors are required to submit their papers electronically in PDF format. All submissions should clearly present the author information including the names of the authors, the affiliations and the emails.
Submission site is located at https://www.easychair.org/conferences/?conf=ldmta2011
Workshop Co-chairs
Program Committee
Steering Committee
Contact information
To edit this page, please click here. Powered by ArnetMiner.Org.