pair index:0 citer id:14 citer title:a condensation approach to privacy preserving data mining citer abstract:in recent years, privacy preserving data mining has become an important problem because of the large amount of personal data which is tracked by many business applications. in many cases, users are unwilling to provide personal information unless the privacy of sensitive information is guaranteed. in this paper, we propose a new framework for privacy preserving data mining of multi-dimensional data. previous work for privacy preserving data mining uses a perturbation approach which reconstructs data distributions in order to perform the mining. such an approach treats each dimension independently and therefore ignores the correlations between the di erent dimensions. in addition, it requires the development of a new distribution based algorithm for each data mining problem, since it does not use the multi-dimensional records, but uses aggregate distributions of the data as input. this leads to a fundamental re-design of data mining algorithms. in this paper, we will develop a new and exible approach for privacy preserving data mining which does not require new problem-specific algorithms, since it maps the original data set into a new anonymized data set. this anonymized data closely matches the characteristics of the original data including the correlations among the di erent dimensions. we present empirical results illustrating the e ectiveness of the method citee id:15 citee title:on the design and quantification of privacy preserving data mining algorithms citee abstract:the increasing ability to track and collect large amounts of data with the use of current hardware technology has lead to an interest in the development of data mining algorithms which preserve user privacy. a recently proposed technique addresses the issue of privacy preservation by perturbing the data and reconstructing distributions at an aggregate level in order to perform the mining. this method is able to retain privacy while accessing the information implicit in the original attributes.... surrounding text:therefore, in order to ensure e ective data collection, it is important to design methods which can mine the data with a guarantee of privacy. this has resulted to a considerable amount of focus on privacy preserving data collection and mining methods in recent years [1]<2>, [***]<2>, [3]<2>, [4]<2>, [6]<2>, [8]<2>, [9]<2>, [12]<2>, [13]<2>. a perturbation based approach to privacy preserving data mining was pioneered in [1]<1>. an iterative algorithm has been proposed in the same work in order to estimate the data distribution fx. a convergence result was proved in [***]<2> for a refinement of this algorithm. in addition, the paper in [***]<2> provides a framework for e ective quantification of the e ectiveness of a (perturbation-based) privacy preserving data mining approach. a convergence result was proved in [***]<2> for a refinement of this algorithm. in addition, the paper in [***]<2> provides a framework for e ective quantification of the e ectiveness of a (perturbation-based) privacy preserving data mining approach. we note that the perturbation approach results in some amount of information loss influence:1 type:2 pair index:1 citer id:14 citer title:a condensation approach to privacy preserving data mining citer abstract:in recent years, privacy preserving data mining has become an important problem because of the large amount of personal data which is tracked by many business applications. in many cases, users are unwilling to provide personal information unless the privacy of sensitive information is guaranteed. in this paper, we propose a new framework for privacy preserving data mining of multi-dimensional data. previous work for privacy preserving data mining uses a perturbation approach which reconstructs data distributions in order to perform the mining. such an approach treats each dimension independently and therefore ignores the correlations between the di erent dimensions. in addition, it requires the development of a new distribution based algorithm for each data mining problem, since it does not use the multi-dimensional records, but uses aggregate distributions of the data as input. this leads to a fundamental re-design of data mining algorithms. in this paper, we will develop a new and exible approach for privacy preserving data mining which does not require new problem-specific algorithms, since it maps the original data set into a new anonymized data set. this anonymized data closely matches the characteristics of the original data including the correlations among the di erent dimensions. we present empirical results illustrating the e ectiveness of the method citee id:16 citee title:security and privacy implications of data mining citee abstract:data mining enables us to discover information wedo not expect to find in databasesfi this can bea security/privacy issue: if we make informationavailable, are we perhaps giving out more than webargained for? this position paper discusses possibleproblems and solutions, and outlines ideas for furtherresearch in this areafi1 introductiondatabase technology provides a number of advantagesfidata mining is one of these; using automatedtools to analyze corporate data can help findways to surrounding text:therefore, in order to ensure e ective data collection, it is important to design methods which can mine the data with a guarantee of privacy. this has resulted to a considerable amount of focus on privacy preserving data collection and mining methods in recent years [1]<2>, [2]<2>, [3]<2>, [***]<2>, [6]<2>, [8]<2>, [9]<2>, [12]<2>, [13]<2>. a perturbation based approach to privacy preserving data mining was pioneered in [1]<1> influence:1 type:2 pair index:2 citer id:14 citer title:a condensation approach to privacy preserving data mining citer abstract:in recent years, privacy preserving data mining has become an important problem because of the large amount of personal data which is tracked by many business applications. in many cases, users are unwilling to provide personal information unless the privacy of sensitive information is guaranteed. in this paper, we propose a new framework for privacy preserving data mining of multi-dimensional data. previous work for privacy preserving data mining uses a perturbation approach which reconstructs data distributions in order to perform the mining. such an approach treats each dimension independently and therefore ignores the correlations between the di erent dimensions. in addition, it requires the development of a new distribution based algorithm for each data mining problem, since it does not use the multi-dimensional records, but uses aggregate distributions of the data as input. this leads to a fundamental re-design of data mining algorithms. in this paper, we will develop a new and exible approach for privacy preserving data mining which does not require new problem-specific algorithms, since it maps the original data set into a new anonymized data set. this anonymized data closely matches the characteristics of the original data including the correlations among the di erent dimensions. we present empirical results illustrating the e ectiveness of the method citee id:17 citee title:privacy preserving association rule mining in vertically partitioned data citee abstract:privacy considerations often constrain data mining projects. this paper addresses the problem of association rule mining where transactions are distributed across sources. each site holds some attributes of each transaction, and the sites wish to collaborate to identify globally valid association rules. however, the sites must not reveal individual transaction data. we present a two-party algorithm for efficiently discovering frequent itemsets with minimum support levels, without either site revealing individual transaction values. categories and subject descriptors h.2.8 : database applications data mining; h.2.4 : systems distributed databases; h.2.7 : database administrationsecurity, integrity, and protection surrounding text:therefore, in order to ensure e ective data collection, it is important to design methods which can mine the data with a guarantee of privacy. this has resulted to a considerable amount of focus on privacy preserving data collection and mining methods in recent years [1]<2>, [2]<2>, [3]<2>, [4]<2>, [***]<2>, [8]<2>, [9]<2>, [12]<2>, [13]<2>. a perturbation based approach to privacy preserving data mining was pioneered in [1]<1> influence:1 type:2 pair index:3 citer id:14 citer title:a condensation approach to privacy preserving data mining citer abstract:in recent years, privacy preserving data mining has become an important problem because of the large amount of personal data which is tracked by many business applications. in many cases, users are unwilling to provide personal information unless the privacy of sensitive information is guaranteed. in this paper, we propose a new framework for privacy preserving data mining of multi-dimensional data. previous work for privacy preserving data mining uses a perturbation approach which reconstructs data distributions in order to perform the mining. such an approach treats each dimension independently and therefore ignores the correlations between the di erent dimensions. in addition, it requires the development of a new distribution based algorithm for each data mining problem, since it does not use the multi-dimensional records, but uses aggregate distributions of the data as input. this leads to a fundamental re-design of data mining algorithms. in this paper, we will develop a new and exible approach for privacy preserving data mining which does not require new problem-specific algorithms, since it maps the original data set into a new anonymized data set. this anonymized data closely matches the characteristics of the original data including the correlations among the di erent dimensions. we present empirical results illustrating the e ectiveness of the method citee id:19 citee title:data swapping: balancing privacy against precision in mining for logic rules citee abstract:the recent proliferation of data mining tools for the analysis of large volumes of data has paid little attention to individual privacy issues. here, we introduce methods aimed at finding a balance between the individuals�� right to privacy and the data-miners�� need to find general patterns in huge volumes of detailed records. in particular, we focus on the data-mining task of classification with decision trees. we base our security-control mechanism on noise-addition techniques used in statis tical databases because (1) the multidimensional matrix model of statistical databases and the multidimensional cubes of on-line analytical processing (olap) are essentially the same, and (2) noise-addition techniques are very robust. the main drawback of noise addition techniques in the context of statistical databases is low statistical quality of released statistics. we argue that in data mining the major requirement of security control mechanism (in addition to protect privacy) is not to ensure precise and bias-free statistics, but rather to preserve the high-level descriptions of knowledge constructed by artificial data mining tools. surrounding text:therefore, in order to ensure e ective data collection, it is important to design methods which can mine the data with a guarantee of privacy. this has resulted to a considerable amount of focus on privacy preserving data collection and mining methods in recent years [1]<2>, [2]<2>, [3]<2>, [4]<2>, [6]<2>, [***]<2>, [9]<2>, [12]<2>, [13]<2>. a perturbation based approach to privacy preserving data mining was pioneered in [1]<1> influence:1 type:2 pair index:4 citer id:14 citer title:a condensation approach to privacy preserving data mining citer abstract:in recent years, privacy preserving data mining has become an important problem because of the large amount of personal data which is tracked by many business applications. in many cases, users are unwilling to provide personal information unless the privacy of sensitive information is guaranteed. in this paper, we propose a new framework for privacy preserving data mining of multi-dimensional data. previous work for privacy preserving data mining uses a perturbation approach which reconstructs data distributions in order to perform the mining. such an approach treats each dimension independently and therefore ignores the correlations between the di erent dimensions. in addition, it requires the development of a new distribution based algorithm for each data mining problem, since it does not use the multi-dimensional records, but uses aggregate distributions of the data as input. this leads to a fundamental re-design of data mining algorithms. in this paper, we will develop a new and exible approach for privacy preserving data mining which does not require new problem-specific algorithms, since it maps the original data set into a new anonymized data set. this anonymized data closely matches the characteristics of the original data including the correlations among the di erent dimensions. we present empirical results illustrating the e ectiveness of the method citee id:20 citee title:privacy preserving mining of association rules citee abstract:we present a framework for mining association rules fromtransactions consisting of categorical items where the datahas been randomized to preserve privacy of individual trans-actions. while it is feasible to recover association rules andpreserve privacy using a straightforward uniform" random-ization, the discovered rules can unfortunately be exploitedto nd privacy breaches. we analyze the nature of privacybreaches and propose a class of randomization operatorsthat are much more eective than uniform randomization inlimiting the breaches. we derive formulae for an unbiasedsupport estimator and its variance, which allow us to re-cover itemset supports from randomized datasets, and showhow to incorporate these formulae into mining algorithms.finally, we present experimental results that validate thealgorithm by applying it on real datasets. surrounding text:therefore, in order to ensure e ective data collection, it is important to design methods which can mine the data with a guarantee of privacy. this has resulted to a considerable amount of focus on privacy preserving data collection and mining methods in recent years [1]<2>, [2]<2>, [3]<2>, [4]<2>, [6]<2>, [8]<2>, [***]<2>, [12]<2>, [13]<2>. a perturbation based approach to privacy preserving data mining was pioneered in [1]<1>. this means that for each individual data problem such as classification, clustering, or association rule mining, a new distribution based data mining algorithm needs to be developed. for example, the work in [1]<1> develops a new distribution based data mining algorithm for the classification problem, whereas the techniques in [***]<2>, and [16]<2> develop methods for privacy preserving association rule mining. while some clever approaches have been developed for distribution based mining of data for particular problems such as association rules and classification, it is clear that using distributions instead of original records greatly restricts the range of algorithmic techniques that can be used on the data influence:1 type:2 pair index:5 citer id:14 citer title:a condensation approach to privacy preserving data mining citer abstract:in recent years, privacy preserving data mining has become an important problem because of the large amount of personal data which is tracked by many business applications. in many cases, users are unwilling to provide personal information unless the privacy of sensitive information is guaranteed. in this paper, we propose a new framework for privacy preserving data mining of multi-dimensional data. previous work for privacy preserving data mining uses a perturbation approach which reconstructs data distributions in order to perform the mining. such an approach treats each dimension independently and therefore ignores the correlations between the di erent dimensions. in addition, it requires the development of a new distribution based algorithm for each data mining problem, since it does not use the multi-dimensional records, but uses aggregate distributions of the data as input. this leads to a fundamental re-design of data mining algorithms. in this paper, we will develop a new and exible approach for privacy preserving data mining which does not require new problem-specific algorithms, since it maps the original data set into a new anonymized data set. this anonymized data closely matches the characteristics of the original data including the correlations among the di erent dimensions. we present empirical results illustrating the e ectiveness of the method citee id:21 citee title:an efficient approach to clustering in large multimedia databases with noise citee abstract: several clustering algorithms can be applied to clustering in large multimedia databasesfi the e ectiveness and e ciency of the existing algorithms, however, is somewhat limited, since clustering in multimedia databases requires cluster-ing high-dimensional feature vectors and since multimedia databases often contain large amounts of noisefi in this pa-per, we therefore introduce a new algorithm to clustering in large multimedia databases called denclue (density-based clustering)fi the basic idea of our new approach is to model the overall point density analytically as the sum of in uence functions of the data pointsfi clusters can then be identi ed by determining density-attractors and clusters of arbitrary shape can be easily described by a simple equa-tion based on the overall density functionfi the advantages of our new approach are (1) it has a rm mathematical basis, (2) it has good clustering properties in data sets with large amounts of noise, (3) it allows a compact mathematical de-scription of arbitrarily shaped clusters in high-dimensional data sets and (4) it is signi cantly faster than existing algo-rithmsfi to demonstrate the e ectiveness and e ciency of denclue, we perform a series of experiments on a num-ber of di erent data sets from cad and molecular biologyfi a comparison with dbscan shows the superiority of our new approachfi keywords: clustering algorithms, density-based clus-tering, clustering of high-dimensional data, clustering in multimedia databases, clustering in the presence of noise 1 introduction because of the fast technological progress, the amount of data which is stored in databases increases very fastfi the types of data which are stored in the computer be-come increasingly complexfi in addition to numerical data, complex 2d and 3d multimedia data such as im-age, cad, geographic, and molecular biology data are stored in databasesfi for an e cient retrieval, the com-plex data is usually transformed into high-dimensional feature vectorsfi examples of feature vectors are color histograms , shape descriptors , fourier vectors , text descriptors , etcfi in many of the mentioned applications, the databases are very large and consist of millions of data objects with several tens to a few hundreds of dimensionsfi automated knowledge discovery in large multimedia databases is an increasingly important research issuefi clustering and trend detection in such databases, how-ever, is di cult since the databases often contain large amounts of noise and sometimes only a small portion of the large databases accounts for the clusteringfi in addition, most of the known algorithms do not work ef-ciently on high-dimensional datafi the methods which surrounding text:this results in a more robust classification model. we note that the e ect of anomalies in the data are also observed for a number of other data mining problems such as clustering [***]<2>. while this paper studies classification as one example, it would be interesting to study other data mining problems as well influence:3 type:3 pair index:6 citer id:14 citer title:a condensation approach to privacy preserving data mining citer abstract:in recent years, privacy preserving data mining has become an important problem because of the large amount of personal data which is tracked by many business applications. in many cases, users are unwilling to provide personal information unless the privacy of sensitive information is guaranteed. in this paper, we propose a new framework for privacy preserving data mining of multi-dimensional data. previous work for privacy preserving data mining uses a perturbation approach which reconstructs data distributions in order to perform the mining. such an approach treats each dimension independently and therefore ignores the correlations between the di erent dimensions. in addition, it requires the development of a new distribution based algorithm for each data mining problem, since it does not use the multi-dimensional records, but uses aggregate distributions of the data as input. this leads to a fundamental re-design of data mining algorithms. in this paper, we will develop a new and exible approach for privacy preserving data mining which does not require new problem-specific algorithms, since it maps the original data set into a new anonymized data set. this anonymized data closely matches the characteristics of the original data including the correlations among the di erent dimensions. we present empirical results illustrating the e ectiveness of the method citee id:23 citee title:a data distortion by probability distribution citee abstract:this paper introduces data distortion by probability distribution, a probability distortion that involves three steps. the first step is to identify the underlying density function of the original series and to estimate the parameters of this density function. the second step is to generate a series of data from the estimated density function. and the final step is to map and replace the generated series for the original one. because it is replaced by the distorted data set, probability distortion guards the privacy of an individual belonging to the original data set. at the same time, the probability distorted series provides asymptotically the same statistical properties as those of the original series, since both are under the same distribution. unlike conventional point distortion, probability distortion is difficult to compromise by repeated queries, and provides a maximum exposure for statistical analysis. surrounding text:therefore, in order to ensure e ective data collection, it is important to design methods which can mine the data with a guarantee of privacy. this has resulted to a considerable amount of focus on privacy preserving data collection and mining methods in recent years [1]<2>, [2]<2>, [3]<2>, [4]<2>, [6]<2>, [8]<2>, [9]<2>, [***]<2>, [13]<2>. a perturbation based approach to privacy preserving data mining was pioneered in [1]<1> influence:2 type:2 pair index:7 citer id:14 citer title:a condensation approach to privacy preserving data mining citer abstract:in recent years, privacy preserving data mining has become an important problem because of the large amount of personal data which is tracked by many business applications. in many cases, users are unwilling to provide personal information unless the privacy of sensitive information is guaranteed. in this paper, we propose a new framework for privacy preserving data mining of multi-dimensional data. previous work for privacy preserving data mining uses a perturbation approach which reconstructs data distributions in order to perform the mining. such an approach treats each dimension independently and therefore ignores the correlations between the di erent dimensions. in addition, it requires the development of a new distribution based algorithm for each data mining problem, since it does not use the multi-dimensional records, but uses aggregate distributions of the data as input. this leads to a fundamental re-design of data mining algorithms. in this paper, we will develop a new and exible approach for privacy preserving data mining which does not require new problem-specific algorithms, since it maps the original data set into a new anonymized data set. this anonymized data closely matches the characteristics of the original data including the correlations among the di erent dimensions. we present empirical results illustrating the e ectiveness of the method citee id:24 citee title:privacy interfaces for information management citee abstract:to facilitate the sharing of information using modern communicationnetworks, users must be able to decide on aprivacy policy---what information to conceal, what to reveal,and to whomfi we describe the evolution of privacyinterfaces---the user interfaces for specifying privacy policies---in collabclio, a system for sharing web browsinghistoriesfi our experience has shown us that privacy policiesought to be treated as first-class objects: policy objectsshould have an intensional surrounding text:therefore, in order to ensure e ective data collection, it is important to design methods which can mine the data with a guarantee of privacy. this has resulted to a considerable amount of focus on privacy preserving data collection and mining methods in recent years [1]<2>, [2]<2>, [3]<2>, [4]<2>, [6]<2>, [8]<2>, [9]<2>, [12]<2>, [***]<2>. a perturbation based approach to privacy preserving data mining was pioneered in [1]<1> influence:2 type:2 pair index:8 citer id:14 citer title:a condensation approach to privacy preserving data mining citer abstract:in recent years, privacy preserving data mining has become an important problem because of the large amount of personal data which is tracked by many business applications. in many cases, users are unwilling to provide personal information unless the privacy of sensitive information is guaranteed. in this paper, we propose a new framework for privacy preserving data mining of multi-dimensional data. previous work for privacy preserving data mining uses a perturbation approach which reconstructs data distributions in order to perform the mining. such an approach treats each dimension independently and therefore ignores the correlations between the di erent dimensions. in addition, it requires the development of a new distribution based algorithm for each data mining problem, since it does not use the multi-dimensional records, but uses aggregate distributions of the data as input. this leads to a fundamental re-design of data mining algorithms. in this paper, we will develop a new and exible approach for privacy preserving data mining which does not require new problem-specific algorithms, since it maps the original data set into a new anonymized data set. this anonymized data closely matches the characteristics of the original data including the correlations among the di erent dimensions. we present empirical results illustrating the e ectiveness of the method citee id:26 citee title:maintaining data privacy in association rule mining citee abstract:data mining services require accurate input data for their results to be meaningful, but privacy concerns may influence users to provide spurious information. we investigate here, with respect to mining association rules, whether users can be encouraged to provide correct information by ensuring that the mining process cannot, with any reasonable degree of certainty, violate their privacy. we present a scheme, based on probabilistic distortion of user data, that can simultaneously provide a high degree of privacy to the user and retain a high level of accuracy in the mining results. the performance of the scheme is validated against representative real and synthetic datasets. surrounding text:this means that for each individual data problem such as classification, clustering, or association rule mining, a new distribution based data mining algorithm needs to be developed. for example, the work in [1]<1> develops a new distribution based data mining algorithm for the classification problem, whereas the techniques in [9]<2>, and [***]<2> develop methods for privacy preserving association rule mining. while some clever approaches have been developed for distribution based mining of data for particular problems such as association rules and classification, it is clear that using distributions instead of original records greatly restricts the range of algorithmic techniques that can be used on the data influence:2 type:2 pair index:9 citer id:14 citer title:a condensation approach to privacy preserving data mining citer abstract:in recent years, privacy preserving data mining has become an important problem because of the large amount of personal data which is tracked by many business applications. in many cases, users are unwilling to provide personal information unless the privacy of sensitive information is guaranteed. in this paper, we propose a new framework for privacy preserving data mining of multi-dimensional data. previous work for privacy preserving data mining uses a perturbation approach which reconstructs data distributions in order to perform the mining. such an approach treats each dimension independently and therefore ignores the correlations between the di erent dimensions. in addition, it requires the development of a new distribution based algorithm for each data mining problem, since it does not use the multi-dimensional records, but uses aggregate distributions of the data as input. this leads to a fundamental re-design of data mining algorithms. in this paper, we will develop a new and exible approach for privacy preserving data mining which does not require new problem-specific algorithms, since it maps the original data set into a new anonymized data set. this anonymized data closely matches the characteristics of the original data including the correlations among the di erent dimensions. we present empirical results illustrating the e ectiveness of the method citee id:27 citee title:protecting privacy when disclosing information: kanonymity and its enforcement through generalization and suppression citee abstract:today's globally networked society places great demand on the dissemination and sharing of person-specific data. situations where aggregate statistical information was once the reporting norm now rely heavily on the transfer of microscopically detailed transaction and encounter information. this happens at a time when more and more historically public information is also electronically available. when these data are linked together, they provide an electronic shadow of a person or organization surrounding text:thus, there is a natural trade-o between greater accuracy and loss of privacy. another interesting method for privacy preserving data mining is the kanonymity model [***]<2>. in the k-anonymity model, domain generalization hierarchies are used in order to transform and replace each record value with a corresponding generalized value. (2) it can be e ectively used in situations with dynamic data updates such as the data stream problem. this is not the case for the work in [***]<2>, which essentially assumes that the entire data set is available apriori. this paper is organized as follows influence:1 type:2 pair index:10 citer id:17 citer title:privacy preserving association rule mining in vertically partitioned data citer abstract:privacy considerations often constrain data mining projects. this paper addresses the problem of association rule mining where transactions are distributed across sources. each site holds some attributes of each transaction, and the sites wish to collaborate to identify globally valid association rules. however, the sites must not reveal individual transaction data. we present a two-party algorithm for efficiently discovering frequent itemsets with minimum support levels, without either site revealing individual transaction values. categories and subject descriptors h.2.8 : database applications data mining; h.2.4 : systems distributed databases; h.2.7 : database administrationsecurity, integrity, and protection citee id:15 citee title:on the design and quantification of privacy preserving data mining algorithms citee abstract:the increasing ability to track and collect large amounts of data with the use of current hardware technology has lead to an interest in the development of data mining algorithms which preserve user privacy. a recently proposed technique addresses the issue of privacy preservation by perturbing the data and reconstructing distributions at an aggregate level in order to perform the mining. this method is able to retain privacy while accessing the information implicit in the original attributes.... surrounding text:in [4]<2>, data perturbation techniques are used to protect individual privacy for classification, by adding random values from a normal/gaussian distribution of mean 0 to the actual data values). one problem with this approach is a tradeo between privacy and the accuracy of the results[***]<2>. more recently, data perturbation has been applied to boolean association rules [18]<2> influence:2 type:2 pair index:11 citer id:17 citer title:privacy preserving association rule mining in vertically partitioned data citer abstract:privacy considerations often constrain data mining projects. this paper addresses the problem of association rule mining where transactions are distributed across sources. each site holds some attributes of each transaction, and the sites wish to collaborate to identify globally valid association rules. however, the sites must not reveal individual transaction data. we present a two-party algorithm for efficiently discovering frequent itemsets with minimum support levels, without either site revealing individual transaction values. categories and subject descriptors h.2.8 : database applications data mining; h.2.4 : systems distributed databases; h.2.7 : database administrationsecurity, integrity, and protection citee id:733 citee title:mining association rules between sets of items in large databases citee abstract:we are given a large database of customer transactionsfi each transaction consists of items purchased by a customer in a visitfi we present an e cient algorithm that generates all signi cant association rules between items in the databasefi the algorithm incorporates bu er management and novel estimation and pruning techniquesfi we also present results of applying this algorithm to sales data obtained from a large retailing company, which shows the e ectiveness of the algorithmfi surrounding text:problem definition we consider the heterogeneous database scenario considered in [7]<1>, a vertical partitioning of the database between two parties a and b. the association rule mining problem can be formally stated as follows[***]<1>: let i = fi1. i2 influence:3 type:3 pair index:12 citer id:17 citer title:privacy preserving association rule mining in vertically partitioned data citer abstract:privacy considerations often constrain data mining projects. this paper addresses the problem of association rule mining where transactions are distributed across sources. each site holds some attributes of each transaction, and the sites wish to collaborate to identify globally valid association rules. however, the sites must not reveal individual transaction data. we present a two-party algorithm for efficiently discovering frequent itemsets with minimum support levels, without either site revealing individual transaction values. categories and subject descriptors h.2.8 : database applications data mining; h.2.4 : systems distributed databases; h.2.7 : database administrationsecurity, integrity, and protection citee id:550 citee title:fast algorithms for mining association rules citee abstract:we consider the problem of discovering association rulesbetween items in a large database of sales transactions.we present two new algorithms for solving this problemthat are fundamentally dierent from the known algo-rithms. empirical evaluation shows that these algorithmsoutperform the known algorithms by factors ranging fromthree for small problems to more than an order of mag-nitude for large problems. we also show how the bestfeatures of the two proposed algorithms can be combinedinto a hybrid algorithm, called apriorihybrid. scale-upexperiments show that apriorihybrid scales linearly withthe number of transactions. apriorihybrid also has ex-cellent scale-up properties with respect to the transactionsize and the number of items in the database surrounding text:this is done by generating a superset of possible candidate itemsets and pruning this set. [***]<1> discusses the function in detail. given the counts and frequent itemsets, we can compute all association rules with support  minsup influence:3 type:3 pair index:13 citer id:17 citer title:privacy preserving association rule mining in vertically partitioned data citer abstract:privacy considerations often constrain data mining projects. this paper addresses the problem of association rule mining where transactions are distributed across sources. each site holds some attributes of each transaction, and the sites wish to collaborate to identify globally valid association rules. however, the sites must not reveal individual transaction data. we present a two-party algorithm for efficiently discovering frequent itemsets with minimum support levels, without either site revealing individual transaction values. categories and subject descriptors h.2.8 : database applications data mining; h.2.4 : systems distributed databases; h.2.7 : database administrationsecurity, integrity, and protection citee id:819 citee title:privacy-preserving data mining citee abstract:a fruitful direction for future data mining research will be the development of techniques that incorporate privacy concernsfi specifically, we address the following questionfi since the primary task in data mining is the development of models about aggregated data, can we develop accurate models without access to precise information in individual data records? we consider the concrete case of building a decision-tree classifier from training data in which the values of individual records have surrounding text:however, none of this work addresses privacy concerns. there has been research considering how much information can be inferred, calculated or revealed from the data made available through data mining algorithms, and how to minimize the leakage of information [15, ***]<2>. however, this has been restricted to classification, and the problem has been treated with an \all or nothing" approach. corporations may not require absolute zero knowledge protocols (that leak no information at all) as long as they can keep the information shared within bounds. in [***]<2>, data perturbation techniques are used to protect individual privacy for classification, by adding random values from a normal/gaussian distribution of mean 0 to the actual data values). one problem with this approach is a tradeo between privacy and the accuracy of the results[1]<2> influence:3 type:2 pair index:14 citer id:17 citer title:privacy preserving association rule mining in vertically partitioned data citer abstract:privacy considerations often constrain data mining projects. this paper addresses the problem of association rule mining where transactions are distributed across sources. each site holds some attributes of each transaction, and the sites wish to collaborate to identify globally valid association rules. however, the sites must not reveal individual transaction data. we present a two-party algorithm for efficiently discovering frequent itemsets with minimum support levels, without either site revealing individual transaction values. categories and subject descriptors h.2.8 : database applications data mining; h.2.4 : systems distributed databases; h.2.7 : database administrationsecurity, integrity, and protection citee id:224 citee title:an extensible meta-learning approach for scalable and accurate inductive learning citee abstract:much of the research in inductive learning concentrates on problems with relatively small amounts of data. with the coming age of ubiquitous network computing, it is likely that orders of magnitude more data in databases will be available for various learning problems of real world importance. some learning algorithms assume that the entire data set fits into main memory, which is not feasible for massive amounts of data, especially for applications in data mining. one approach to handling a large data set is to partition the data set into subsets, run the learning algorithm on each of the subsets, and combine the results. moreover, data can be inherently distributed across multiple sites on the network and merging all the data in one location can be expensive or prohibitive. in this thesis we propose, investigate, and evaluate a meta-learning approach to integrating the results of multiple learning processes. our approach utilizes machine learning to guide the integration. we identified two main meta-learning strategies: combiner and arbiter. both strategies are independent to the learning algorithms used in generating the classifiers. the combiner strategy attempts to reveal relationships among the learned classifiers' prediction patterns. the arbiter strategy tries to determine the correct prediction when the classifiers have different opinions. various schemes under these two strategies have been developed. empirical results show that our schemes can obtain accurate classifiers from inaccurate classifiers trained from data subsets. we also implemented and analyzed the schemes in a parallel and distributed environment to demonstrate their scalability surrounding text:background and related work the centralized data mining model assumes that all the data required by any data mining algorithm is either available at or can be sent to a central site. a simple approach to data mining over multiple sources that will not share data is to run existing data mining tools at each site independently and combine the results[***, 6, 17]<2>. however, this will often fail to give globally valid results influence:2 type:3 pair index:15 citer id:17 citer title:privacy preserving association rule mining in vertically partitioned data citer abstract:privacy considerations often constrain data mining projects. this paper addresses the problem of association rule mining where transactions are distributed across sources. each site holds some attributes of each transaction, and the sites wish to collaborate to identify globally valid association rules. however, the sites must not reveal individual transaction data. we present a two-party algorithm for efficiently discovering frequent itemsets with minimum support levels, without either site revealing individual transaction values. categories and subject descriptors h.2.8 : database applications data mining; h.2.4 : systems distributed databases; h.2.7 : database administrationsecurity, integrity, and protection citee id:775 citee title:on the accuracy of meta-learning for scalable data mining citee abstract:in this paper, we describe a general approach to scaling data mining applications that we have come to call meta-learning. meta-learning refers to a general strategy that seeks to learn how to combine a number of separate learning processes in an intelligent fashion. we desire a meta-learning architecture that exhibits two key behaviors. first, the meta-learning strategy must produce an accurate final classification system. this means that a meta-learning architecture must produce a final outcome that is at least as accurate as a conventional learning algorithm applied to all available data. second, it must be fast, relative to an individual sequential learning algorithm when applied to massive databases of examples, and operate in a reasonable amount of time. this paper focussed primarily on issues related to the accuracy and efficacy of meta-learning as a general strategy. a number of empirical results are presented demonstrating that meta-learning is technically feasible in wide-area, network computing environments. surrounding text:background and related work the centralized data mining model assumes that all the data required by any data mining algorithm is either available at or can be sent to a central site. a simple approach to data mining over multiple sources that will not share data is to run existing data mining tools at each site independently and combine the results[5, ***, 17]<2>. however, this will often fail to give globally valid results. distributed classification has also been addressed. a meta-learning approach has been developed that uses classifiers trained at di erent sites to develop a global classifier [***, 17]<2>. this could protect the individual entities, but it remains to be shown that the individual classifiers do not disclose private information influence:2 type:3 pair index:16 citer id:17 citer title:privacy preserving association rule mining in vertically partitioned data citer abstract:privacy considerations often constrain data mining projects. this paper addresses the problem of association rule mining where transactions are distributed across sources. each site holds some attributes of each transaction, and the sites wish to collaborate to identify globally valid association rules. however, the sites must not reveal individual transaction data. we present a two-party algorithm for efficiently discovering frequent itemsets with minimum support levels, without either site revealing individual transaction values. categories and subject descriptors h.2.8 : database applications data mining; h.2.4 : systems distributed databases; h.2.7 : database administrationsecurity, integrity, and protection citee id:443 citee title:distributed web mining using bayesian networks from multiple data streams citee abstract:we present a collective approach to mine bayesian net-works from distributed heterogenous web-log data streams. in this approach we first learn a local bayesian network at each site using the local data. then each site identifies the observations that are most likely to be evidence of coupling between local and non-local variables and transmits asub-set of these observations to a central site. another bayesian network is learnt at the central site using the data transmittedfrom the local site. the local and central bayesian networks are combined to obtain a collective bayesian net-work, that models the entire data. we applied this techniqueto mine multiple data streams where data centralization is difficult because of large response time and scalability issues.experimental results and theoretical justification that demonstrate the feasibility of our approach are presented. surrounding text:this could protect the individual entities, but it remains to be shown that the individual classifiers do not disclose private information. recent work has addressed classification using bayesian networks in vertically partitioned data [***]<2>, and situations where the distribution is itself interesting with respect to what is learned [19]<2>. however, none of this work addresses privacy concerns influence:2 type:3 pair index:17 citer id:17 citer title:privacy preserving association rule mining in vertically partitioned data citer abstract:privacy considerations often constrain data mining projects. this paper addresses the problem of association rule mining where transactions are distributed across sources. each site holds some attributes of each transaction, and the sites wish to collaborate to identify globally valid association rules. however, the sites must not reveal individual transaction data. we present a two-party algorithm for efficiently discovering frequent itemsets with minimum support levels, without either site revealing individual transaction values. categories and subject descriptors h.2.8 : database applications data mining; h.2.4 : systems distributed databases; h.2.7 : database administrationsecurity, integrity, and protection citee id:479 citee title:efficient mining of association rules in distributed databases citee abstract:many sequential algorithms have been proposed for mining of association rulesfi however, verylittle work has been done in mining association rules in distributed databasesfi a direct applicationof sequential algorithms to distributed databases is not effective, because it requires a large amountof communication overheadfi in this study, an efficient algorithm, dma, is proposedfi it generatesa small number of candidate sets and requires only o(n) messages for support count exchangefor each surrounding text:algorithms have been proposed for distributed data mining. cheung et al$ proposed a method for horizontally partitioned data[***]<2>, and more recent work has addressed privacy in this model[14]<2>. distributed classification has also been addressed influence:3 type:2 pair index:18 citer id:17 citer title:privacy preserving association rule mining in vertically partitioned data citer abstract:privacy considerations often constrain data mining projects. this paper addresses the problem of association rule mining where transactions are distributed across sources. each site holds some attributes of each transaction, and the sites wish to collaborate to identify globally valid association rules. however, the sites must not reveal individual transaction data. we present a two-party algorithm for efficiently discovering frequent itemsets with minimum support levels, without either site revealing individual transaction values. categories and subject descriptors h.2.8 : database applications data mining; h.2.4 : systems distributed databases; h.2.7 : database administrationsecurity, integrity, and protection citee id:820 citee title:secure multi-party computation problems and their applications: a review and open problems citee abstract:the growth of the internet has triggered tremendous opportunities for cooperative computation, where people are jointly conducting computation tasks based on the private inputs they each suppliesfi these computations could occur between mutually untrusted parties, or even between competitorsfi for example, customers might send to a remote database queries that contain private information; two competing financial organizations might jointly invest in a project that must satisfy both organizations" surrounding text:this is inefficient for large inputs, as in data mining. in [***]<2>, relationships have been drawn between several problems in data mining and secure multiparty computation. although this shows that secure solutions exist, achieving efficient secure solutions for privacy preserving distributed data mining is still open influence:2 type:3 pair index:19 citer id:17 citer title:privacy preserving association rule mining in vertically partitioned data citer abstract:privacy considerations often constrain data mining projects. this paper addresses the problem of association rule mining where transactions are distributed across sources. each site holds some attributes of each transaction, and the sites wish to collaborate to identify globally valid association rules. however, the sites must not reveal individual transaction data. we present a two-party algorithm for efficiently discovering frequent itemsets with minimum support levels, without either site revealing individual transaction values. categories and subject descriptors h.2.8 : database applications data mining; h.2.4 : systems distributed databases; h.2.7 : database administrationsecurity, integrity, and protection citee id:821 citee title:secure multi-party computational geometry citee abstract:: the general secure multi-party computation problem is whenmultiple parties (say, alice and bob) each have private data (respectively,a and b) and seek to compute some function f(a; b) without revealingto each other anything unintended (ifiefi, anything other than what canbe inferred from knowing f(a; b))fi it is well known that, in theory, thegeneral secure multi-party computation problem is solvable using circuitevaluation protocolsfi while this approach is appealing in its surrounding text:the component algorithm secure computation of scalar product is the key to our protocol. scalar product protocols have been proposed in the secure multiparty computation literature[***]<2>, however these cryptographic solutions do not scale well to this data mining problem. we give an algebraic solution that hides true values by placing them in equations masked with random values. 1g when we mine boolean association rules, the input values (xi and yi values) are restricted to 0 or 1. this creates a disclosure risk, both with our protocol and with other scalar product protocols [***, 13]<2>. recall that a provides n + r equations in 2n unknowns influence:3 type:2 pair index:20 citer id:17 citer title:privacy preserving association rule mining in vertically partitioned data citer abstract:privacy considerations often constrain data mining projects. this paper addresses the problem of association rule mining where transactions are distributed across sources. each site holds some attributes of each transaction, and the sites wish to collaborate to identify globally valid association rules. however, the sites must not reveal individual transaction data. we present a two-party algorithm for efficiently discovering frequent itemsets with minimum support levels, without either site revealing individual transaction values. categories and subject descriptors h.2.8 : database applications data mining; h.2.4 : systems distributed databases; h.2.7 : database administrationsecurity, integrity, and protection citee id:161 citee title:a secure protocol for computing dot products in clustered and distributed environments citee abstract:dot-products form the basis of various applications ranging from scientific computations to commercial applications in data mining and transaction processing. typical scientific computations utilizing sparse iterative solvers use repeated matrix-vector products. these canbe viewed as dot-products of sparse vectors. in database applications, dot-products take the form of counting operations. with widespread use of clustered and distributed platforms, these operations are increasingly being performed across networked hosts. traditional apis for messaging are susceptible to sniffing, and the data being transferred between hosts is often enough to compromise the entire computation. for example, in a domain decomposition based sparse solver, the entire solution can often be reconstructed easily from boundary values that are communicated on the net. in yet other applications, dot-products may be performed across two hosts that do not want to disclose their vectors, yet, they need to compute the dot-product. in each of these cases, there is a need for secure and anonymous dot-productprotocols. due to the large computational requirements of underlying applications, it is highly desirable that secure protocols add minimal overhead to the original algorithm. finally, by its very nature, dot-products leak limited amounts of information ? one of the parties can detect an entry of the other party's vector by simply probing it with a vector with a 1 in a particular location and zeros elsewhere. given all of these constraints, traditional cryptographic protocols are generally unsuitable due to their significant computational and communication overheads. in this paper, we present an extremely efficient and sufficiently secure protocol for computing the dot-product of two vectors using linear algebraic techniques. using analytical as well as experimental results, we demonstrate superior performance in terms of computational overhead, numerical stability, and security. we show that the overhead of a two-party dot-product computation using mpi as the messaging api across two high-end workstations connected via a gigabit ethernet approaches multiple 4.69 over an unsecured dot-product. we also show that the average relative error in dot-products across a large number of random (normalized) vectors was roughly 4:5 �� 10-9. surrounding text:the knowledge disclosed by these equations only allows computation of private values if one side learns a substantial number of the private values from an outside source. (a di erent algebraic technique has recently been proposed [***]<2>, however it requires at least twice the bitwise communication cost of the method presented here. ) we assume without loss of generality that n is even. 1g when we mine boolean association rules, the input values (xi and yi values) are restricted to 0 or 1. this creates a disclosure risk, both with our protocol and with other scalar product protocols [10, ***]<2>. recall that a provides n + r equations in 2n unknowns influence:2 type:2 pair index:21 citer id:17 citer title:privacy preserving association rule mining in vertically partitioned data citer abstract:privacy considerations often constrain data mining projects. this paper addresses the problem of association rule mining where transactions are distributed across sources. each site holds some attributes of each transaction, and the sites wish to collaborate to identify globally valid association rules. however, the sites must not reveal individual transaction data. we present a two-party algorithm for efficiently discovering frequent itemsets with minimum support levels, without either site revealing individual transaction values. categories and subject descriptors h.2.8 : database applications data mining; h.2.4 : systems distributed databases; h.2.7 : database administrationsecurity, integrity, and protection citee id:822 citee title:privacy-preserving distributed mining of association rules on horizontally partitioned data citee abstract:�� data mining can extract important knowledge fromlarge data collections �c but sometimes these collections are splitamong various parties. privacy concerns may prevent the partiesfrom directly sharing the data, and some types of informationabout the data. this paper addresses secure mining of associationrules over horizontally partitioned data. the methods incorporatecryptographic techniques to minimize the information shared,while adding little overhead to the mining task. surrounding text:algorithms have been proposed for distributed data mining. cheung et al$ proposed a method for horizontally partitioned data[8]<2>, and more recent work has addressed privacy in this model[***]<2>. distributed classification has also been addressed influence:1 type:2 pair index:22 citer id:17 citer title:privacy preserving association rule mining in vertically partitioned data citer abstract:privacy considerations often constrain data mining projects. this paper addresses the problem of association rule mining where transactions are distributed across sources. each site holds some attributes of each transaction, and the sites wish to collaborate to identify globally valid association rules. however, the sites must not reveal individual transaction data. we present a two-party algorithm for efficiently discovering frequent itemsets with minimum support levels, without either site revealing individual transaction values. categories and subject descriptors h.2.8 : database applications data mining; h.2.4 : systems distributed databases; h.2.7 : database administrationsecurity, integrity, and protection citee id:823 citee title:privacy-preserving association rule mining citee abstract:the current trend in the application space towards systems of loosely coupled and dynamically bound components that enables just-in-time integration jeopardizes the security of information that is shared between the broker, the requester, and the provider at runtime. in particular, new advances in data mining and knowledge discovery, that allow for the extraction of hidden knowledge in enormous amount of data, impose new threats on the seamless integration of information. in this paper, we consider the problem of building privacy preserving algorithms for one category of data mining techniques, the association rule mining. we introduce new metrics in order to demonstrate how security issues can be taken into consideration in the general framework of association rule mining, and we show that the complexity of the new heuristics is similar to this of the original algorithms. surrounding text:one problem with this approach is a tradeo between privacy and the accuracy of the results[1]<2>. more recently, data perturbation has been applied to boolean association rules [***]<2>. one interesting feature of this work is a exible definition of privacy influence:1 type:2 pair index:23 citer id:17 citer title:privacy preserving association rule mining in vertically partitioned data citer abstract:privacy considerations often constrain data mining projects. this paper addresses the problem of association rule mining where transactions are distributed across sources. each site holds some attributes of each transaction, and the sites wish to collaborate to identify globally valid association rules. however, the sites must not reveal individual transaction data. we present a two-party algorithm for efficiently discovering frequent itemsets with minimum support levels, without either site revealing individual transaction values. categories and subject descriptors h.2.8 : database applications data mining; h.2.4 : systems distributed databases; h.2.7 : database administrationsecurity, integrity, and protection citee id:824 citee title:when distribution is part of the semantics: a new problem class for distributed knowledge discovery citee abstract:fi within a research project at daimlerchrysler we use vehicles as mobile data sources for distributed knowledge discoveryfi we realized that current approaches are not suitable for our purposesfi they aim to infer a global model and try to approximate the results one would get from a single joined data sourcefi thus, they treat distribution as a technical issue only and ignore that the distribution itself may have a meaning and that models depend on the context in which they were derivedfi the main contribution of this paper is the identification of a practically relevant new problem class for distributed knowledge discovery which addresses the semantics of distributionfi we show that this problem class is the proper framework for many important applications in which it should become an integral part of the knowledge discovery process, affecting the results as well as the process itselffi we outline a novel solution, called knowledge discovery from models, which uses models as primary input and combines content driven and context driven analysesfi finally, we discuss challenging research questions, which are raised by the new problem classfi surrounding text:this could protect the individual entities, but it remains to be shown that the individual classifiers do not disclose private information. recent work has addressed classification using bayesian networks in vertically partitioned data [7]<2>, and situations where the distribution is itself interesting with respect to what is learned [***]<2>. however, none of this work addresses privacy concerns influence:3 type:3 pair index:24 citer id:22 citer title:transforming data to satisfy privacy constraints citer abstract:data on individuals and entities are being collected widely. these data can contain information that explicitly identifies the individual (e.g., social security number). data can also contain other kinds of personal information (e.g., date of birth, zip code, gender) that are potentially identifying when linked with other available data sets. data are often shared for business or legal reasons. this paper addresses the important issue of preserving the anonymity of the individuals or entities during the data dissemination process. we explore preserving the anonymity by the use of generalizations and suppressions on the potentially identifying portions of the data. we extend earlier works in this area along various dimensions. first, satisfying privacy constraints is considered in conjunction with the usage for the data being disseminated. this allows us to optimize the process of preserving privacy for the specified usage. in particular, we investigate the privacy transformation in the context of data mining applications like building classification and regression models. second, our work improves on previous approaches by allowing more exible generalizations for the data. lastly, this is combined with a more thorough exploration of the solution space using the genetic algorithm framework. these extensions allow us to transform the data so that they are more useful for their intended purpose while satisfying the privacy constraints. general terms privacy, data transformation, generalization, suppression, predictive modeling citee id:819 citee title:privacy-preserving data mining citee abstract:a fruitful direction for future data mining research will be the development of techniques that incorporate privacy concernsfi specifically, we address the following questionfi since the primary task in data mining is the development of models about aggregated data, can we develop accurate models without access to precise information in individual data records? we consider the concrete case of building a decision-tree classifier from training data in which the values of individual records have surrounding text:the tradeo between information loss and the re-identification risk using such perturbative methods is being actively researched [4, 21]<3>. data masked using only additive noise was used to generate classification models which were evaluated using a synthetic benchmark in [***]<3>. for predictive modeling applications, further work is needed to quantify and evaluate the tradeo s between model accuracy and the probabilistic disclosure risk on real data sets influence:3 type:3 pair index:25 citer id:22 citer title:transforming data to satisfy privacy constraints citer abstract:data on individuals and entities are being collected widely. these data can contain information that explicitly identifies the individual (e.g., social security number). data can also contain other kinds of personal information (e.g., date of birth, zip code, gender) that are potentially identifying when linked with other available data sets. data are often shared for business or legal reasons. this paper addresses the important issue of preserving the anonymity of the individuals or entities during the data dissemination process. we explore preserving the anonymity by the use of generalizations and suppressions on the potentially identifying portions of the data. we extend earlier works in this area along various dimensions. first, satisfying privacy constraints is considered in conjunction with the usage for the data being disseminated. this allows us to optimize the process of preserving privacy for the specified usage. in particular, we investigate the privacy transformation in the context of data mining applications like building classification and regression models. second, our work improves on previous approaches by allowing more exible generalizations for the data. lastly, this is combined with a more thorough exploration of the solution space using the genetic algorithm framework. these extensions allow us to transform the data so that they are more useful for their intended purpose while satisfying the privacy constraints. general terms privacy, data transformation, generalization, suppression, predictive modeling citee id:338 citee title:comparing sdc methods for microdata on the basis of information loss and disclosure risk citee abstract:: we present in this paper the first empirical comparison of sdc methodsfor microdata which encompasses both continuous and categorical microdatafi basedon re-identification experiments, we try to optimize the tradeoff between informationloss and disclosure riskfi first, relevant sdc methods for continuous and categoricalmicrodata are identifiedfi then generic information loss measures (not targeted tospecific data uses) are defined, both in the continuous and the categorical casefi surrounding text:addition of noise and selective data swapping are used in [11]<1> to generate masked data with small disclosure risk while preserving means and correlations between attributes, even in many sub-domains. the tradeo between information loss and the re-identification risk using such perturbative methods is being actively researched [***, 21]<3>. data masked using only additive noise was used to generate classification models which were evaluated using a synthetic benchmark in [1]<3> influence:2 type:2 pair index:26 citer id:22 citer title:transforming data to satisfy privacy constraints citer abstract:data on individuals and entities are being collected widely. these data can contain information that explicitly identifies the individual (e.g., social security number). data can also contain other kinds of personal information (e.g., date of birth, zip code, gender) that are potentially identifying when linked with other available data sets. data are often shared for business or legal reasons. this paper addresses the important issue of preserving the anonymity of the individuals or entities during the data dissemination process. we explore preserving the anonymity by the use of generalizations and suppressions on the potentially identifying portions of the data. we extend earlier works in this area along various dimensions. first, satisfying privacy constraints is considered in conjunction with the usage for the data being disseminated. this allows us to optimize the process of preserving privacy for the specified usage. in particular, we investigate the privacy transformation in the context of data mining applications like building classification and regression models. second, our work improves on previous approaches by allowing more exible generalizations for the data. lastly, this is combined with a more thorough exploration of the solution space using the genetic algorithm framework. these extensions allow us to transform the data so that they are more useful for their intended purpose while satisfying the privacy constraints. general terms privacy, data transformation, generalization, suppression, predictive modeling citee id:851 citee title:supervised and unsupervised discretization of continuous features citee abstract:many supervised machine learning algo-rithms require a discrete feature space. inthis paper, we review previous work on con-tinuous feature discretization, identify den-ing characteristics of the methods, and con-duct an empirical evaluation of several meth-ods. we compare binning, an unsuperviseddiscretization method, to entropy-based andpurity-based methods, which are supervisedalgorithms. we found that the performanceof the naive-bayes algorithm signicantlyimproved when features were discretized us-ing an entropy-based method. in fact, overthe 16 tested datasets, the discretized versionof naive-bayes slightly outperformed c4.5 onaverage. we also show that in some cases,the performance of the c4.5 induction algo-rithm signicantly improved if features werediscretized in advance; in our experiments,the performance never signicantly degraded,an interesting phenomenon considering thefact that c4.5 is capable of locally discretiz-ing features. surrounding text:g. , [***, 9]<3>). the bit string for a numeric column is made up of one bit for each potential end point in value order influence:3 type:3 pair index:27 citer id:22 citer title:transforming data to satisfy privacy constraints citer abstract:data on individuals and entities are being collected widely. these data can contain information that explicitly identifies the individual (e.g., social security number). data can also contain other kinds of personal information (e.g., date of birth, zip code, gender) that are potentially identifying when linked with other available data sets. data are often shared for business or legal reasons. this paper addresses the important issue of preserving the anonymity of the individuals or entities during the data dissemination process. we explore preserving the anonymity by the use of generalizations and suppressions on the potentially identifying portions of the data. we extend earlier works in this area along various dimensions. first, satisfying privacy constraints is considered in conjunction with the usage for the data being disseminated. this allows us to optimize the process of preserving privacy for the specified usage. in particular, we investigate the privacy transformation in the context of data mining applications like building classification and regression models. second, our work improves on previous approaches by allowing more exible generalizations for the data. lastly, this is combined with a more thorough exploration of the solution space using the genetic algorithm framework. these extensions allow us to transform the data so that they are more useful for their intended purpose while satisfying the privacy constraints. general terms privacy, data transformation, generalization, suppression, predictive modeling citee id:583 citee title:genetic algorithms in search, optimization and machine learning citee abstract:this book brings together - in an informal and tutorial fashion - the computer techniques, mathematical tools, and research results that will enable both students and practitioners to apply genetic algorithms to problems in many fields. major concepts are illustrated with running examples, and major algorithms are illustrated by pascal computer programs. no prior knowledge of gas or genetics is assumed, and only a minimum of computer programming and mathematics background is required. surrounding text:this factor, along with the variety in the metrics to be optimized, motivated the use of a general framework for optimization. the genetic algorithm framework was chosen because of its exible formulation and its ability to find good solutions given adequate computational resources [8, ***]<1>. clearly, this randomized approach does not guarantee finding a globally optimal solution influence:2 type:3 pair index:28 citer id:22 citer title:transforming data to satisfy privacy constraints citer abstract:data on individuals and entities are being collected widely. these data can contain information that explicitly identifies the individual (e.g., social security number). data can also contain other kinds of personal information (e.g., date of birth, zip code, gender) that are potentially identifying when linked with other available data sets. data are often shared for business or legal reasons. this paper addresses the important issue of preserving the anonymity of the individuals or entities during the data dissemination process. we explore preserving the anonymity by the use of generalizations and suppressions on the potentially identifying portions of the data. we extend earlier works in this area along various dimensions. first, satisfying privacy constraints is considered in conjunction with the usage for the data being disseminated. this allows us to optimize the process of preserving privacy for the specified usage. in particular, we investigate the privacy transformation in the context of data mining applications like building classification and regression models. second, our work improves on previous approaches by allowing more exible generalizations for the data. lastly, this is combined with a more thorough exploration of the solution space using the genetic algorithm framework. these extensions allow us to transform the data so that they are more useful for their intended purpose while satisfying the privacy constraints. general terms privacy, data transformation, generalization, suppression, predictive modeling citee id:865 citee title:use of contextual information for feature ranking and discretization citee abstract:deriving classification rules or decision trees from examples is an important problem. when there are too many features, discarding weak features before the derivation process is highly desirable. when there are numeric features, they need to be discretized for the rule generation. we present a new approach to these problems. traditional techniques make use of feature merits based on either the information theoretic, or the statistical correlation between each feature and the class. we instead assign merits to features by finding each feature's "obligation" to the class discrimination in the context of other features. the merits are then used to rank the features, select a feature subset, and discretize the numeric variables. experience with benchmark example sets demonstrates that the new approach is a powerful alternative to the traditional methods. this paper concludes by posing some new technical issues that arise from this approach. surrounding text:g. , [5, ***]<3>). the bit string for a numeric column is made up of one bit for each potential end point in value order influence:3 type:3 pair index:29 citer id:22 citer title:transforming data to satisfy privacy constraints citer abstract:data on individuals and entities are being collected widely. these data can contain information that explicitly identifies the individual (e.g., social security number). data can also contain other kinds of personal information (e.g., date of birth, zip code, gender) that are potentially identifying when linked with other available data sets. data are often shared for business or legal reasons. this paper addresses the important issue of preserving the anonymity of the individuals or entities during the data dissemination process. we explore preserving the anonymity by the use of generalizations and suppressions on the potentially identifying portions of the data. we extend earlier works in this area along various dimensions. first, satisfying privacy constraints is considered in conjunction with the usage for the data being disseminated. this allows us to optimize the process of preserving privacy for the specified usage. in particular, we investigate the privacy transformation in the context of data mining applications like building classification and regression models. second, our work improves on previous approaches by allowing more exible generalizations for the data. lastly, this is combined with a more thorough exploration of the solution space using the genetic algorithm framework. these extensions allow us to transform the data so that they are more useful for their intended purpose while satisfying the privacy constraints. general terms privacy, data transformation, generalization, suppression, predictive modeling citee id:711 citee title:masking microdata files citee abstract:government agencies collect many types of data, but due to confidentiality restrictions, use of the microdata is often limited to sworn agents working on secure computer systems at those agencies. these restrictions can severely affect public policy decisions made at one agency that has access to nonconfidential summary statistics only. this necessitates creation of microdata which not only meets the confidentiality requirements but also has sufficient utility. this paper describes a general methodology for producing public-use data files that preserves confidentiality and allows many analytical uses. the methodology masks quantitative data using an additive-noise approach and then, when necessary, employs a reidentification/swapping methodology to assure confidentiality. one of the major advantages of this masking scheme is that it also allows obtaining precise subpopulation estimates, which is not possible with other known masking schemes. in addition, if controlled distortion is applied, then a prespecified subset of subpopulation estimates from the masked file could be nearly identical to those from the unmasked file. this paper provides the theoretical underpinning of the masking methodology and the results of its actual application using examples. surrounding text:this is being fueled by progress in various technologies like storage, networking and automation in various business processes. of particular interest are data containing structured information on individuals (referred to as micro-data in [***, 10, 14]<3>). such data are permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page influence:2 type:3 pair index:30 citer id:22 citer title:transforming data to satisfy privacy constraints citer abstract:data on individuals and entities are being collected widely. these data can contain information that explicitly identifies the individual (e.g., social security number). data can also contain other kinds of personal information (e.g., date of birth, zip code, gender) that are potentially identifying when linked with other available data sets. data are often shared for business or legal reasons. this paper addresses the important issue of preserving the anonymity of the individuals or entities during the data dissemination process. we explore preserving the anonymity by the use of generalizations and suppressions on the potentially identifying portions of the data. we extend earlier works in this area along various dimensions. first, satisfying privacy constraints is considered in conjunction with the usage for the data being disseminated. this allows us to optimize the process of preserving privacy for the specified usage. in particular, we investigate the privacy transformation in the context of data mining applications like building classification and regression models. second, our work improves on previous approaches by allowing more exible generalizations for the data. lastly, this is combined with a more thorough exploration of the solution space using the genetic algorithm framework. these extensions allow us to transform the data so that they are more useful for their intended purpose while satisfying the privacy constraints. general terms privacy, data transformation, generalization, suppression, predictive modeling citee id:719 citee title:measures of disclosure risk and harm citee abstract:disclosure is a difficult topic. even the definition of disclosure depends on the context. sometimes it is enough to violate anonymity. sometimes sensitive information has to be revealed. sometimes a disclosure is said to occur even though the information revealed is incorrect. this paper tries to untangle disclosure issues by differentiating between linking a respondent to a record and learning sensitive information from the linking. the extent to which a released record can be linked to a respondent determines disclosure risk; the information revealed when a respondent is linked to a released record determines disclosure harm. there can be harm even if the wrong record is identified or an incorrect sensitive value inferred. in this paper, measures of disclosure risk and harm that reflect what is learned about a respondent are studied, and some implications for data release policies are given. surrounding text:an example in [14]<1> illustrates the identification by linking a medical data set and a voter list using fields like zip code, date of birth and gender. in addition to the identity disclosure problem discussed above, attribute disclosure occurs when something about an individual is learnt from the released data [***]<1>. attribute disclosure can happen even without identity disclosure influence:2 type:3 pair index:31 citer id:22 citer title:transforming data to satisfy privacy constraints citer abstract:data on individuals and entities are being collected widely. these data can contain information that explicitly identifies the individual (e.g., social security number). data can also contain other kinds of personal information (e.g., date of birth, zip code, gender) that are potentially identifying when linked with other available data sets. data are often shared for business or legal reasons. this paper addresses the important issue of preserving the anonymity of the individuals or entities during the data dissemination process. we explore preserving the anonymity by the use of generalizations and suppressions on the potentially identifying portions of the data. we extend earlier works in this area along various dimensions. first, satisfying privacy constraints is considered in conjunction with the usage for the data being disseminated. this allows us to optimize the process of preserving privacy for the specified usage. in particular, we investigate the privacy transformation in the context of data mining applications like building classification and regression models. second, our work improves on previous approaches by allowing more exible generalizations for the data. lastly, this is combined with a more thorough exploration of the solution space using the genetic algorithm framework. these extensions allow us to transform the data so that they are more useful for their intended purpose while satisfying the privacy constraints. general terms privacy, data transformation, generalization, suppression, predictive modeling citee id:831 citee title:protecting respondents' identities in microdata release citee abstract:today��s globally networked society places great demand on the dissemination and sharing of information. while in the past released information was mostly in tabular and statistical form, many situations call today for the release of specific data (microdata). in order to protect the anonymity of the entities (called respondents) to which information refers, data holders often remove or encrypt explicit identifiers such as names, addresses, and phone numbers. de-identifying data, however, provides no guarantee of anonymity. released information often contains other data, such as race, birth date, sex, and zip code, that can be linked to publicly available information to re-identify respondents and inferring information that was not intended for disclosure. in this paper we address the problem of releasing microdata while safeguarding the anonymity of the respondents to which the data refer. the approach is based on the definition of k-anonymity. a table provides k-anonymity if attempts to link explicitly identifying information to its content map the information to at least k entities. we illustrate how k-anonymity can be provided without compromising the integrity (or truthfulness) of the information released by using generalization and suppression techniques. we introduce the concept of minimal generalization that captures the property of the release process not to distort the data more than needed to achieve k-anonymity, and present an algorithm for the computation of such a generalization. we also discuss possible preference policies to choose among different minimal generalizations. surrounding text:this is being fueled by progress in various technologies like storage, networking and automation in various business processes. of particular interest are data containing structured information on individuals (referred to as micro-data in [11, 10, ***]<3>). such data are permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page influence:2 type:3 pair index:32 citer id:22 citer title:transforming data to satisfy privacy constraints citer abstract:data on individuals and entities are being collected widely. these data can contain information that explicitly identifies the individual (e.g., social security number). data can also contain other kinds of personal information (e.g., date of birth, zip code, gender) that are potentially identifying when linked with other available data sets. data are often shared for business or legal reasons. this paper addresses the important issue of preserving the anonymity of the individuals or entities during the data dissemination process. we explore preserving the anonymity by the use of generalizations and suppressions on the potentially identifying portions of the data. we extend earlier works in this area along various dimensions. first, satisfying privacy constraints is considered in conjunction with the usage for the data being disseminated. this allows us to optimize the process of preserving privacy for the specified usage. in particular, we investigate the privacy transformation in the context of data mining applications like building classification and regression models. second, our work improves on previous approaches by allowing more exible generalizations for the data. lastly, this is combined with a more thorough exploration of the solution space using the genetic algorithm framework. these extensions allow us to transform the data so that they are more useful for their intended purpose while satisfying the privacy constraints. general terms privacy, data transformation, generalization, suppression, predictive modeling citee id:830 citee title:protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression citee abstract:today"s globally networked society places great demand on the dissemination and sharing of person-specificdatafi situations where aggregate statistical information was once the reporting norm now rely heavily on thetransfer of microscopically detailed transaction and encounter informationfi this happens at a time whenmore and more historically public information is also electronically availablefi when these data are linkedtogether, they provide an electronic shadow of a person or organization surrounding text:s. a, however, it has been pointed out that this is not sufficient since the released data contains other information which when linked with other data sets can identify or narrow down the individuals or entities [10, ***, 17, 14]<1>. an example in [14]<1> illustrates the identification by linking a medical data set and a voter list using fields like zip code, date of birth and gender. suppression could be viewed as the ultimate generalization since no information is released. as before, the data transformation challenge is to find the right tradeo between the amount of privacy and loss of information content due to generalizations and suppressions [10, ***, 17, 14]<1>. this paper uses the approach of transforming the data using generalizations and suppression to satisfy the privacy constraints. a more general notion that allows specific entries (cells) to be suppressed has also been proposed [10]<1>. we extend the earlier works [10, ***, 17, 14]<1> by allowing more exible generalizations as described next. the form of the allowed generalization depends on the type of data in a potentially identifying column. we also include in this case the situation when the usage is unknown at the time of dissemination. for this case we define a metric which captures some general notion of information loss as was done in the earlier works [***, 17, 14]<1>. our metric di ers from the earlier ones because it has to handle the more exible generalizations allowed in our work influence:1 type:1 pair index:33 citer id:22 citer title:transforming data to satisfy privacy constraints citer abstract:data on individuals and entities are being collected widely. these data can contain information that explicitly identifies the individual (e.g., social security number). data can also contain other kinds of personal information (e.g., date of birth, zip code, gender) that are potentially identifying when linked with other available data sets. data are often shared for business or legal reasons. this paper addresses the important issue of preserving the anonymity of the individuals or entities during the data dissemination process. we explore preserving the anonymity by the use of generalizations and suppressions on the potentially identifying portions of the data. we extend earlier works in this area along various dimensions. first, satisfying privacy constraints is considered in conjunction with the usage for the data being disseminated. this allows us to optimize the process of preserving privacy for the specified usage. in particular, we investigate the privacy transformation in the context of data mining applications like building classification and regression models. second, our work improves on previous approaches by allowing more exible generalizations for the data. lastly, this is combined with a more thorough exploration of the solution space using the genetic algorithm framework. these extensions allow us to transform the data so that they are more useful for their intended purpose while satisfying the privacy constraints. general terms privacy, data transformation, generalization, suppression, predictive modeling citee id:773 citee title:on identification disclosure and prediction disclosure for microdata citee abstract:two definitions of statistical disclosure - identification disclosure and prediction disclosure - are compared. identification disclosure implies prediction disclosure but not vice versa. it is argued, however, that if sampling takes place then cases where prediction disclosure occurs and identification disclosure does not either have very small probability or do not present disclosure problems different from those normally met in the release of aggregate statistics. finally the estimation of population uniqueness using the poisson-gamma model is considered. surrounding text:g. , [6, ***, 3]<3>). applying the k-anonymity approach to the release of a sample opens up some new issues influence:3 type:3 pair index:34 citer id:22 citer title:transforming data to satisfy privacy constraints citer abstract:data on individuals and entities are being collected widely. these data can contain information that explicitly identifies the individual (e.g., social security number). data can also contain other kinds of personal information (e.g., date of birth, zip code, gender) that are potentially identifying when linked with other available data sets. data are often shared for business or legal reasons. this paper addresses the important issue of preserving the anonymity of the individuals or entities during the data dissemination process. we explore preserving the anonymity by the use of generalizations and suppressions on the potentially identifying portions of the data. we extend earlier works in this area along various dimensions. first, satisfying privacy constraints is considered in conjunction with the usage for the data being disseminated. this allows us to optimize the process of preserving privacy for the specified usage. in particular, we investigate the privacy transformation in the context of data mining applications like building classification and regression models. second, our work improves on previous approaches by allowing more exible generalizations for the data. lastly, this is combined with a more thorough exploration of the solution space using the genetic algorithm framework. these extensions allow us to transform the data so that they are more useful for their intended purpose while satisfying the privacy constraints. general terms privacy, data transformation, generalization, suppression, predictive modeling citee id:857 citee title:the genitor algorithm and selective pressure: why rank-based allocation of reproductive trials is best citee abstract:this paper reports work done over the past three years using rank-based allocation of reproductive trials. new evidence and arguments are presented which suggest that allocating reproductive trials according to rank is superior to fitness proportionate reproduction. ranking can not only be used to slow search speed, but also to increase search speed when appropriate. furthermore, the use of ranking provides a degree of control over selective pressure that is not possible with fitness surrounding text:two of the key operations in the iterative process are crossover (combine portions of two solutions to produce two other solutions) and mutation (incrementally modify a solution). the specific form of the genetic algorithm used in our application is based on the genitor work [***]<1>. extensions to this work that were necessary for our application will be described later. considering our example, if bit b3 is 1, then this implies that bits b2, b4, b5, and b6 are also 1. the genetic algorithm [***]<1> has to be extended to ensure that the chromosomes in the population represent valid generalizations for the categorical columns. this is done by an additional step that modifies newly generated chromosomes that are invalid into valid ones while retaining as much as possible of the original characteristics. this is done by an additional step that modifies newly generated chromosomes that are invalid into valid ones while retaining as much as possible of the original characteristics. the genetic algorithm used [***]<1> requires choosing some parameters like the size of the population, the probability of mutating a bit in the chromosome and the number of iterations to be run. these parameters are typically chosen based on some experimentation influence:3 type:3 pair index:35 citer id:22 citer title:transforming data to satisfy privacy constraints citer abstract:data on individuals and entities are being collected widely. these data can contain information that explicitly identifies the individual (e.g., social security number). data can also contain other kinds of personal information (e.g., date of birth, zip code, gender) that are potentially identifying when linked with other available data sets. data are often shared for business or legal reasons. this paper addresses the important issue of preserving the anonymity of the individuals or entities during the data dissemination process. we explore preserving the anonymity by the use of generalizations and suppressions on the potentially identifying portions of the data. we extend earlier works in this area along various dimensions. first, satisfying privacy constraints is considered in conjunction with the usage for the data being disseminated. this allows us to optimize the process of preserving privacy for the specified usage. in particular, we investigate the privacy transformation in the context of data mining applications like building classification and regression models. second, our work improves on previous approaches by allowing more exible generalizations for the data. lastly, this is combined with a more thorough exploration of the solution space using the genetic algorithm framework. these extensions allow us to transform the data so that they are more useful for their intended purpose while satisfying the privacy constraints. general terms privacy, data transformation, generalization, suppression, predictive modeling citee id:848 citee title:statistical disclosure control in practice citee abstract:the aim of this book is to discuss various aspects associated with disseminating personal or business data collected in censuses or surveys or copied from administrative sources. the problem is to present the data in such a form that they are useful for statistical research and to provide sufficient protection for the individuals or businesses to whom the data refer. the major part of this book is concerned with how to define the disclosure problem and how to deal with it in practical circumstances surrounding text:the dissemination could be to satisfy some legal requirements or as part of some business process. an important issue that has to be addressed is the protection of the privacy of individuals or entities referred to in the released micro-data [***, 20, 6]<1>. an obvious step to protect the privacy of the individuals (or entities) is to replace any explicitly identifying information by some randomized placeholder. g. , physical or mental health of an individual) [***]<1>. one approach to handling sensitive attributes is to exclude them from public use data sets [***]<3> influence:2 type:3 pair index:36 citer id:22 citer title:transforming data to satisfy privacy constraints citer abstract:data on individuals and entities are being collected widely. these data can contain information that explicitly identifies the individual (e.g., social security number). data can also contain other kinds of personal information (e.g., date of birth, zip code, gender) that are potentially identifying when linked with other available data sets. data are often shared for business or legal reasons. this paper addresses the important issue of preserving the anonymity of the individuals or entities during the data dissemination process. we explore preserving the anonymity by the use of generalizations and suppressions on the potentially identifying portions of the data. we extend earlier works in this area along various dimensions. first, satisfying privacy constraints is considered in conjunction with the usage for the data being disseminated. this allows us to optimize the process of preserving privacy for the specified usage. in particular, we investigate the privacy transformation in the context of data mining applications like building classification and regression models. second, our work improves on previous approaches by allowing more exible generalizations for the data. lastly, this is combined with a more thorough exploration of the solution space using the genetic algorithm framework. these extensions allow us to transform the data so that they are more useful for their intended purpose while satisfying the privacy constraints. general terms privacy, data transformation, generalization, suppression, predictive modeling citee id:505 citee title:elements of statistical disclosure control citee abstract:statistical disclosure control is the discipline that deals with producing statistical data that are safe enough to be released to external researchers. this book concentrates on the methodology of the area. it deals with both microdata (individual data) and tabular (aggregated) data. the book attempts to develop the theory from what can be called the paradigm of statistical confidentiality: to modify unsafe data in such a way that safe (enough) data emerge, with minimum information loss. this book discusses what safe data, are, how information loss can be measured, and how to modify the data in a (near) optimal way. once it has been decided how to measure safety and information loss, the production of safe data from unsafe data is often a matter of solving an optimization problem. several such problems are discussed in the book, and most of them turn out to be hard problems that can be solved only approximately. the authors present new results that have not been published before. the book is not a description of an area that is closed, but, on the contrary, one that still has many spots awaiting to be more fully explored. some of these are indicated in the book. the book will be useful for official, social and medical statisticians and others who are involved in releasing personal or business data for statistical use. operations researchers may be interested in the optimization problems involved, particularly for the challenges they present. leon willenborg has worked at the department of statistical methods at statistics netherlands since 1983, first as a researcher and since 1989 as a senior researcher. since 1989 his main field of research and consultancy has been statistical disclosure control. from 1996-1998 he was the project coordinator of the eu co-funded sdc project surrounding text:the dissemination could be to satisfy some legal requirements or as part of some business process. an important issue that has to be addressed is the protection of the privacy of individuals or entities referred to in the released micro-data [19, ***, 6]<1>. an obvious step to protect the privacy of the individuals (or entities) is to replace any explicitly identifying information by some randomized placeholder influence:2 type:3 pair index:37 citer id:22 citer title:transforming data to satisfy privacy constraints citer abstract:data on individuals and entities are being collected widely. these data can contain information that explicitly identifies the individual (e.g., social security number). data can also contain other kinds of personal information (e.g., date of birth, zip code, gender) that are potentially identifying when linked with other available data sets. data are often shared for business or legal reasons. this paper addresses the important issue of preserving the anonymity of the individuals or entities during the data dissemination process. we explore preserving the anonymity by the use of generalizations and suppressions on the potentially identifying portions of the data. we extend earlier works in this area along various dimensions. first, satisfying privacy constraints is considered in conjunction with the usage for the data being disseminated. this allows us to optimize the process of preserving privacy for the specified usage. in particular, we investigate the privacy transformation in the context of data mining applications like building classification and regression models. second, our work improves on previous approaches by allowing more exible generalizations for the data. lastly, this is combined with a more thorough exploration of the solution space using the genetic algorithm framework. these extensions allow us to transform the data so that they are more useful for their intended purpose while satisfying the privacy constraints. general terms privacy, data transformation, generalization, suppression, predictive modeling citee id:418 citee title:disclosure risk assessment in perturbative microdata protection citee abstract:this paper describes methods for data perturbation that include rank swapping and additive noise. it also describes enhanced methods of re-identification using probabilistic record linkage. the empirical comparisons use variants of the framework for measuring information loss and re-identification risk that were introduced by domingo-ferrer and mateo-sanz. surrounding text:addition of noise and selective data swapping are used in [11]<1> to generate masked data with small disclosure risk while preserving means and correlations between attributes, even in many sub-domains. the tradeo between information loss and the re-identification risk using such perturbative methods is being actively researched [4, ***]<3>. data masked using only additive noise was used to generate classification models which were evaluated using a synthetic benchmark in [1]<3> influence:2 type:2 pair index:38 citer id:35 citer title:a document-centric approach to static index pruning in text retrieval systems citer abstract:we present a static index pruning method, to be used in ad-hoc document retrieval tasks, that follows a documentcentric approach to decide whether a posting for a given term should remain in the index or not. the decision is made based on the term's contribution to the document's kullback-leibler divergence from the text collection's global language model. our technique can be used to decrease the size of the index by over 90%, at only a minor decrease in retrieval e ectiveness. it thus allows us to make the index small enough to fit entirely into the main memory of a single pc, even for large text collections containing millions of documents. this results in great efficiency gains, superior to those of earlier pruning methods, and an average response time around 20 ms on the gov2 document collection citee id:36 citee title:pruned query evaluation using precomputed impacts citee abstract:exhaustive evaluation of ranked queries can be expensive, particularly when only a small subset of the overall ranking is required, or when queries contain common terms. this concern gives rise to techniques for dynamic query pruning, that is, methods for eliminating redundant parts of the usual exhaustive evaluation, yet still generating a demonstrably "good enough" set of answers to the query. in this work we propose new pruning methods that make use of impact-sorted indexes. compared to exhaustive evaluation, the new methods reduce the amount of computation performed, reduce the amount of memory required for accumulators, reduce the amount of data transferred from disk, and at the same time allow performance guarantees in terms of precision and mean average precision. these strong claims are backed by experiments using the trec terabyte collection and queries surrounding text:the di erence is that they use the surrogates for query expansion, while we use them to build the index, and that our term selection method (kl divergence) is di erent from theirs. anh and mo at [***]<2> have recently presented a very e ective pruning technique that is non-static, but requires the index to be in impact-order instead of the traditional documentorder. their technique allows to change the amount of pruning dynamically, after the index has been created, and represents an interesting alternative to our pruning method influence:1 type:2 pair index:39 citer id:35 citer title:a document-centric approach to static index pruning in text retrieval systems citer abstract:we present a static index pruning method, to be used in ad-hoc document retrieval tasks, that follows a documentcentric approach to decide whether a posting for a given term should remain in the index or not. the decision is made based on the term's contribution to the document's kullback-leibler divergence from the text collection's global language model. our technique can be used to decrease the size of the index by over 90%, at only a minor decrease in retrieval e ectiveness. it thus allows us to make the index small enough to fit entirely into the main memory of a single pc, even for large text collections containing millions of documents. this results in great efficiency gains, superior to those of earlier pruning methods, and an average response time around 20 ms on the gov2 document collection citee id:37 citee title:an efficient computation of the multiple-bernoulli language model citee abstract:the multiple bernoulli (mb) language model has been generally considered too computationally expensive for practical purposes and superseded by the more efficient multinomial approach. while, the model has many attractive properties, little is actually known about the retrieval effectiveness of the mb model due to its high cost of execution. in this paper, we show how an efficient implementation of this model can be achieved. the resulting method is comparable in terms of efficiency to other standard term matching algorithms (such as the vector space model, bm25 and the multinomial language model). surrounding text:[9]<2> present a possible approach). our method is not suitable for phrase queries and other query forms that require the existence of positional information in the index, however, it can be used to perform bag-of-words search operations, in conjunction with scoring functions such as okapi bm25 [13]<2> and certain implementations of language-model-based retrieval functions [***]<2>. if a search query requires access to positional index information, a separate index, containing this information, can be used to process the query influence:3 type:3 pair index:40 citer id:35 citer title:a document-centric approach to static index pruning in text retrieval systems citer abstract:we present a static index pruning method, to be used in ad-hoc document retrieval tasks, that follows a documentcentric approach to decide whether a posting for a given term should remain in the index or not. the decision is made based on the term's contribution to the document's kullback-leibler divergence from the text collection's global language model. our technique can be used to decrease the size of the index by over 90%, at only a minor decrease in retrieval e ectiveness. it thus allows us to make the index small enough to fit entirely into the main memory of a single pc, even for large text collections containing millions of documents. this results in great efficiency gains, superior to those of earlier pruning methods, and an average response time around 20 ms on the gov2 document collection citee id:38 citee title:efficient phrase querying with an auxiliary index citee abstract:search engines need to evaluate queries extremely fast, a challenging task given the vast quantities of data being indexed. a significant proportion of the queries posed to search engines involve phrases. in this paper we consider how phrase queries can be efficiently supported with low disk overheads. previous research has shown that phrase queries can be rapidly evaluated using nextword indexes, but these indexes are twice as large as conventional inverted files. we propose a combination of nextword indexes with inverted files as a solution to this problem. our experiments show that combined use of an auxiliary nextword index and a conventional inverted file allow evaluation of phrase queries in half the time required to evaluate such queries with an inverted file alone, and the space overhead is only 10% of the size of the inverted file. further time savings are available with only slight increases in disk requirements. general terms indexing, query evaluation keywords inverted indexes, nextword indexes, evaluation efficiency, index size, stopping, phrase query surrounding text:bahle et al. [***]<2>, for instance, report, that only 8. 3% of the queries they found in an excite query log were phrase queries influence:3 type:3 pair index:41 citer id:35 citer title:a document-centric approach to static index pruning in text retrieval systems citer abstract:we present a static index pruning method, to be used in ad-hoc document retrieval tasks, that follows a documentcentric approach to decide whether a posting for a given term should remain in the index or not. the decision is made based on the term's contribution to the document's kullback-leibler divergence from the text collection's global language model. our technique can be used to decrease the size of the index by over 90%, at only a minor decrease in retrieval e ectiveness. it thus allows us to make the index small enough to fit entirely into the main memory of a single pc, even for large text collections containing millions of documents. this results in great efficiency gains, superior to those of earlier pruning methods, and an average response time around 20 ms on the gov2 document collection citee id:39 citee title:techniques for efficient query expansion citee abstract:fi query expansion is a well-known method for improving av-erage e ectiveness in information retrievalfi however, the most e ective query expansion methods rely on costly retrieval and processing of feed-back documentsfi we explore alternative methods for reducing query-evaluation costs, and propose a new method based on keeping a brief summary of each document in memoryfi this method allows query expan-sion to proceed three times faster than previously, while approximating the e ectiveness of standard expansionfi surrounding text:in contrast to the term-centric pruning methods described above, we propose a document-centric strategy, where the decision whether a posting is taken into the pruned index does not depend on the posting's rank within its term's posting list, but on its rank within the document it refers to. in some sense, our technique is similar to the feedback mechanism based on document surrogates that was proposed by billerbeck and zobel [***]<2>. the di erence is that they use the surrogates for query expansion, while we use them to build the index, and that our term selection method (kl divergence) is di erent from theirs. topics: trec terabyte 2005. finally, we compared our kld-based term selection function to the formula proposed by billerbeck and zobel [***]<2>. although billerbeck's objective (query expansion) is di erent from ours (index pruning), the techniques are similar, raising the question how their method performs compared to ours influence:3 type:2 pair index:42 citer id:35 citer title:a document-centric approach to static index pruning in text retrieval systems citer abstract:we present a static index pruning method, to be used in ad-hoc document retrieval tasks, that follows a documentcentric approach to decide whether a posting for a given term should remain in the index or not. the decision is made based on the term's contribution to the document's kullback-leibler divergence from the text collection's global language model. our technique can be used to decrease the size of the index by over 90%, at only a minor decrease in retrieval e ectiveness. it thus allows us to make the index small enough to fit entirely into the main memory of a single pc, even for large text collections containing millions of documents. this results in great efficiency gains, superior to those of earlier pruning methods, and an average response time around 20 ms on the gov2 document collection citee id:40 citee title:static index pruning for information retrieval systems citee abstract:we introduce static index pruning methods that significantly reduce the index size in information retrieval systems. we investigate uniform and term-based methods that each remove selected entries from the index and yet have only a minor effect on retrieval results. in uniform pruning, there is a fixed cutoff threshold, and all index entries whose contribution to relevance scores is bounded above by a given threshold are removed from the index. in term-based pruning, the cutoff threshold is determined for each term, and thus may vary from term to term. we give experimental evidence that for each level of compression, term-based pruning outperforms uniform pruning, under various measures of precision. we present theoretical and experimental evidence that under our term-based pruning scheme, it is possible to prune the index greatly and still get retrieval results that are almost as good as those based on the full index surrounding text:1 [content analysis and indexing]: indexing methods general terms experimentation, performance keywords information retrieval, index pruning, kl divergence 1. introduction fagin et al$ [***]<1> introduced the concept of static index pruning to information retrieval. in their paper, they describe a term-centric pruning method that, for each term t in the index, only retains its top kt postings, according to the individual score impact that each posting would have if t appeared in an ad-hoc search query (kt may be term-specific, not necessarily constant) influence:1 type:2 pair index:43 citer id:35 citer title:a document-centric approach to static index pruning in text retrieval systems citer abstract:we present a static index pruning method, to be used in ad-hoc document retrieval tasks, that follows a documentcentric approach to decide whether a posting for a given term should remain in the index or not. the decision is made based on the term's contribution to the document's kullback-leibler divergence from the text collection's global language model. our technique can be used to decrease the size of the index by over 90%, at only a minor decrease in retrieval e ectiveness. it thus allows us to make the index small enough to fit entirely into the main memory of a single pc, even for large text collections containing millions of documents. this results in great efficiency gains, superior to those of earlier pruning methods, and an average response time around 20 ms on the gov2 document collection citee id:41 citee title:an information-theoretic approach to automatic query expansion citee abstract:techniques for automatic query expansion from top retrieved documents have shown promise for improving retrieval effectiveness on large collections; however, they often rely on an empirical ground, and there is a shortage of cross-system comparisons. using ideas from information theory, we present a computationally simple and theoretically justified method for assigning scores to candidate expansion terms. such scores are used to select and weight expansion terms within rocchio's framework for query reweigthing. we compare ranking with information-theoretic query expansion versus ranking with other query expansion techniques, showing that the former achieves better retrieval effectiveness on several performance measures. we also discuss the effect on retrieval effectiveness of the main parameters involved in automatic query expansion, such as data sparseness, query difficulty, number of selected documents, and number of selected terms, pointing out interesting relationships. surrounding text:conceptually, for every document d in the collection, we perform a pseudo-relevance feedback step, based on kullback-leibler divergence scores (described by carpineto et al. [***]<1>) at indexing time and only keep postings for the top kd feedback terms extracted from that document in the index, discarding everything else. because pseudo-relevance feedback techniques are very good at finding the set of query terms, given the top search results, this method can be used to very accurately predict the set of queries for which d can make it into the top documents influence:3 type:3 pair index:44 citer id:35 citer title:a document-centric approach to static index pruning in text retrieval systems citer abstract:we present a static index pruning method, to be used in ad-hoc document retrieval tasks, that follows a documentcentric approach to decide whether a posting for a given term should remain in the index or not. the decision is made based on the term's contribution to the document's kullback-leibler divergence from the text collection's global language model. our technique can be used to decrease the size of the index by over 90%, at only a minor decrease in retrieval e ectiveness. it thus allows us to make the index small enough to fit entirely into the main memory of a single pc, even for large text collections containing millions of documents. this results in great efficiency gains, superior to those of earlier pruning methods, and an average response time around 20 ms on the gov2 document collection citee id:42 citee title:improving web search efficiency via a locality based static pruning method citee abstract: the unarguably fast, and continuous, growth of the volume of indexed (and indexable) documents on the web poses a great challenge for search engines this is true regarding not only search e ectiveness but also time and space e - ciency in this paper we present an index pruning technique targeted for search engines that addresses the latter issue without disconsidering the former to this e ect, we adopt a new pruning strategy capable of greatly reducing the size of search engine indices experiments using a real search engine show that our technique can reduce the indices' stor - age costs by up to 60% over traditional lossless compres - sion methods, while keeping the loss in retrieval precision to a minimum when compared to the indices size with no compression at all, the compression rate is higher than 88%, i e , less than one eighth of the original size more importantly, our results indicate that, due to the reduction in storage overhead, query processing time can be reduced to nearly 65% of the original time, with no loss in aver - age precision the new method yields signi cative improve - ments when compared against the best known static prun - ing method for search engine indices in addition, since our technique is orthogonal to the underlying search algorithms, it can be adopted by virtually any search engine surrounding text:indices containing positional information are di erent in nature and not easily accessible to the pruning techniques described here (de moura et al. [***]<2> present a possible approach). our method is not suitable for phrase queries and other query forms that require the existence of positional information in the index, however, it can be used to perform bag-of-words search operations, in conjunction with scoring functions such as okapi bm25 [13]<2> and certain implementations of language-model-based retrieval functions [2]<2>. based on fagin's method, de moura et al. [***]<2> propose a locality-based pruning technique that, instead of only taking the highest-scoring postings into the pruned index, selects all postings corresponding to terms that appear in the same sentence as one of the postings selected by fagin's method. their experimental results indicate that this locality-enhanced version of the pruning algorithm outperforms the original version influence:1 type:2 pair index:45 citer id:35 citer title:a document-centric approach to static index pruning in text retrieval systems citer abstract:we present a static index pruning method, to be used in ad-hoc document retrieval tasks, that follows a documentcentric approach to decide whether a posting for a given term should remain in the index or not. the decision is made based on the term's contribution to the document's kullback-leibler divergence from the text collection's global language model. our technique can be used to decrease the size of the index by over 90%, at only a minor decrease in retrieval e ectiveness. it thus allows us to make the index small enough to fit entirely into the main memory of a single pc, even for large text collections containing millions of documents. this results in great efficiency gains, superior to those of earlier pruning methods, and an average response time around 20 ms on the gov2 document collection citee id:43 citee title:self-indexing inverted files for fast text retrieval citee abstract:query processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. here we show that query response time for conjunctive boolean queries and for informal ranked queries can be dramatically reduced, at little cost in terms of storage, by the inclusion of an internal index in each inverted list. this method has been applied in a retrieval system for a collection of nearly two million short documents. our experimental results show that the selfindexing strategy adds less than 20% to the size of the inverted file, but, for boolean queries of 5{10 terms, can reduce processing time to under one fifth of the previous cost. similarly, ranked queries of 40{50 terms can be evaluated in as little as 25% of the previous time, with little or no loss of retrieval e ectiveness surrounding text:related work traditionally, pruning techniques in information retrieval systems have been dynamic, which means that they were applied at query time in order to reduce the computational cost required to find the set of top documents, given a search query. mo at and zobel [***]<2>, for example, increase their system's query processing performance by restricting the number of document score accumulators maintained in memory in a term-at-a-time query processing framework. persin et al influence:2 type:2 pair index:46 citer id:35 citer title:a document-centric approach to static index pruning in text retrieval systems citer abstract:we present a static index pruning method, to be used in ad-hoc document retrieval tasks, that follows a documentcentric approach to decide whether a posting for a given term should remain in the index or not. the decision is made based on the term's contribution to the document's kullback-leibler divergence from the text collection's global language model. our technique can be used to decrease the size of the index by over 90%, at only a minor decrease in retrieval e ectiveness. it thus allows us to make the index small enough to fit entirely into the main memory of a single pc, even for large text collections containing millions of documents. this results in great efficiency gains, superior to those of earlier pruning methods, and an average response time around 20 ms on the gov2 document collection citee id:44 citee title:filtered document retrieval with frequency-sorted indexes citee abstract:ranking techniques are e ective at finding answers in document collections but can be expensive to evaluate. we propose an evaluation technique that uses early recognition of which documents are likely to be highly ranked to reduce costs; for our test data, queries are evaluated in 2% of the memory of the standard implementation without degradation in retrieval e ectiveness. cpu time and disk traffic can also be dramatically reduced by designing inverted indexes explicitly to support the technique. the principle of the index design is that inverted lists are sorted by decreasing within-document frequency rather than by document number, and this method experimentally reduces cpu time and disk traffic to around one third of the original requirement. we also show that frequency sorting can lead to a net reduction in index size, regardless of whether the index is compressed surrounding text:persin et al. [***]<2> describe a dynamic pruning technique that is based on within-document term frequencies. they also show how the inverted index can be reorganized (frequency-sorted index) in order to better support this kind of pruning influence:1 type:2 pair index:47 citer id:35 citer title:a document-centric approach to static index pruning in text retrieval systems citer abstract:we present a static index pruning method, to be used in ad-hoc document retrieval tasks, that follows a documentcentric approach to decide whether a posting for a given term should remain in the index or not. the decision is made based on the term's contribution to the document's kullback-leibler divergence from the text collection's global language model. our technique can be used to decrease the size of the index by over 90%, at only a minor decrease in retrieval e ectiveness. it thus allows us to make the index small enough to fit entirely into the main memory of a single pc, even for large text collections containing millions of documents. this results in great efficiency gains, superior to those of earlier pruning methods, and an average response time around 20 ms on the gov2 document collection citee id:45 citee title:university of glasgow at trec2004: experiments in web, robust and terabyte tracks with terrier citee abstract:with our participation in trec2004, we test terrier, a modular and scalable information retrieval framework, in three tracks. for the mixed query task of the web track, we employ a decision mechanism for selecting appropriate retrieval approaches on a per-query basis. for the robust track, in order to cope with the poorlyperforming queries, we use two pre-retrieval performance predictors and a weighting function recommender mechanism. we also test a new training approach for the automatic tuning of the term frequency normalisation parameters. in the terabyte track, we employ a distributed version of terrier and test the effectiveness of techniques, such as using the anchor text, pseudo query expansion and selecting different weighting models for each query. surrounding text:wqi is qi's idf weight: wqi = log(n=nqi ), where n is the number of documents in the collection and nqi is the number of documents containing the term qi. for the free parameters, we chose k1 = 1:2 and b = 0:5 { a configuration that was shown to be appropriate for the gov2 collection used in our experiments [***]<3>. within this general framework, document scores are computed in a document-at-a-time fashion influence:3 type:3 pair index:48 citer id:35 citer title:a document-centric approach to static index pruning in text retrieval systems citer abstract:we present a static index pruning method, to be used in ad-hoc document retrieval tasks, that follows a documentcentric approach to decide whether a posting for a given term should remain in the index or not. the decision is made based on the term's contribution to the document's kullback-leibler divergence from the text collection's global language model. our technique can be used to decrease the size of the index by over 90%, at only a minor decrease in retrieval e ectiveness. it thus allows us to make the index small enough to fit entirely into the main memory of a single pc, even for large text collections containing millions of documents. this results in great efficiency gains, superior to those of earlier pruning methods, and an average response time around 20 ms on the gov2 document collection citee id:46 citee title:okapi at trec-7 citee abstract:the article presents text retrieval research results in the areas of automatic ad hoc retrieval, information filtering, vlc (very large collections) and interactive retrieval. for automatic ad hoc retrieval, three runs were submitted: medium (title and description), short (title only) and a run which was a combination of a long run (title, description and narrative) with the medium and short runs. the average precision of the last mentioned run was higher by several percent than any other submitted run, but another participant recently noticed an impossibly high score for one topic in the short run. this led to the discovery that due to a mistake in the indexing procedures part of the subject field of the la times documents had been indexed. use of this field was explicitly forbidden in the guidelines for the ad hoc track. the official runs were repeated against a corrected index, and the corrected results are presented, average precisions being reduced by about 2-4%. in the area of adaptive filtering, efforts focused on the twin problems of: (a) starting from scratch, with no assumed history of relevance judgments for each topic, and (b) having to define a threshold for retrieval. for vlc, four runs on the full database were submitted, together with one each on the 10% and 1% collections. in the area of interactive retrieval, two pairwise comparisons were made: okapi with relevance feedback against okapi without, and okapi without against zprise without. surrounding text:[9]<2> present a possible approach). our method is not suitable for phrase queries and other query forms that require the existence of positional information in the index, however, it can be used to perform bag-of-words search operations, in conjunction with scoring functions such as okapi bm25 [***]<2> and certain implementations of language-model-based retrieval functions [2]<2>. if a search query requires access to positional index information, a separate index, containing this information, can be used to process the query influence:3 type:3 pair index:49 citer id:35 citer title:a document-centric approach to static index pruning in text retrieval systems citer abstract:we present a static index pruning method, to be used in ad-hoc document retrieval tasks, that follows a documentcentric approach to decide whether a posting for a given term should remain in the index or not. the decision is made based on the term's contribution to the document's kullback-leibler divergence from the text collection's global language model. our technique can be used to decrease the size of the index by over 90%, at only a minor decrease in retrieval e ectiveness. it thus allows us to make the index small enough to fit entirely into the main memory of a single pc, even for large text collections containing millions of documents. this results in great efficiency gains, superior to those of earlier pruning methods, and an average response time around 20 ms on the gov2 document collection citee id:47 citee title:compression of inverted indexes for fast query evaluation citee abstract:compression reduces both the size of indexes and the time needed to evaluate queries. in this paper, we revisit the compression of inverted lists of document postings that store the position and frequency of indexed terms, considering two approaches to improving retrieval efficiency: better implementation and better choice of integer compression schemes. first, we propose several simple optimisations to well-known integer compression schemes, and show experimentally that these lead to significant reductions in time. second, we explore the impact of choice of compression scheme on retrieval efficiency. in experiments on large collections of data, we show two surprising results: use of simple byte-aligned codes halves the query evaluation time compared to the most compact golomb-rice bitwise compression schemes; and, even when an index fits entirely in memory, byte-aligned codes result in faster query evaluation than does an uncompressed index, emphasising that the cost of transferring data from memory to the cpu cache is less for an appropriately compressed index than for an uncompressed index. moreover, byte-aligned schemes have only a modest space overhead: the most compact schemes result in indexes that are around 10% of the size of the collection, while a byte-aligned scheme is around 13%. we conclude that fast byte-aligned codes should be used to store integers in inverted lists surrounding text:tf values greater than 24), while all other bits represent the document number the posting refers to. postings are grouped into blocks and compressed using a byte-aligned encoding method [***]<1>. each compressed block contains around 216 postings influence:3 type:2 pair index:50 citer id:35 citer title:a document-centric approach to static index pruning in text retrieval systems citer abstract:we present a static index pruning method, to be used in ad-hoc document retrieval tasks, that follows a documentcentric approach to decide whether a posting for a given term should remain in the index or not. the decision is made based on the term's contribution to the document's kullback-leibler divergence from the text collection's global language model. our technique can be used to decrease the size of the index by over 90%, at only a minor decrease in retrieval e ectiveness. it thus allows us to make the index small enough to fit entirely into the main memory of a single pc, even for large text collections containing millions of documents. this results in great efficiency gains, superior to those of earlier pruning methods, and an average response time around 20 ms on the gov2 document collection citee id:48 citee title:query evaluation: strategies and optimization citee abstract:discusses two query evaluation strategies used in large text retrieval systems: (1) term-at-a-time; and (2) document-at-a-time. describes optimization techniques that can reduce query evaluation costs. presents simulation results that compare the performance of these optimization techniques when applied to natural language query evaluation surrounding text:document descriptors for the top documents encountered so far are kept in memory, again using a heap. standard optimizations, such as maxscore [***]<1>, are applied to reduce the computational cost of processing a query. additional data structures, not described here, allow us to efficiently look up the score impact of a posting, based on the document number and the term frequency, and to quickly produce official trec document ids (e. it shows that the number of postings inspected during query processing is decreased by up to 92% for  = 0:04. the reason why the number of postings that are inspected during query processing is not proportional to the pruning level  is that the query processor employs the maxscore [***]<1> heuristic to ignore postings that cannot change the ranking of the top search results (here for the top 20). thus, many of the postings removed by the pruning process would not have been considered by the query processor anyway influence:3 type:2 pair index:51 citer id:38 citer title:efficient phrase querying with an auxiliary index citer abstract:search engines need to evaluate queries extremely fast, a challenging task given the vast quantities of data being indexed. a significant proportion of the queries posed to search engines involve phrases. in this paper we consider how phrase queries can be efficiently supported with low disk overheads. previous research has shown that phrase queries can be rapidly evaluated using nextword indexes, but these indexes are twice as large as conventional inverted files. we propose a combination of nextword indexes with inverted files as a solution to this problem. our experiments show that combined use of an auxiliary nextword index and a conventional inverted file allow evaluation of phrase queries in half the time required to evaluate such queries with an inverted file alone, and the space overhead is only 10% of the size of the inverted file. further time savings are available with only slight increases in disk requirements. general terms indexing, query evaluation keywords inverted indexes, nextword indexes, evaluation efficiency, index size, stopping, phrase query citee id:480 citee title:vector-space ranking with e ective early termination citee abstract:considerable research effort has been invested in improving the effectiveness of information retrieval systems. techniques such as relevance feedback, thesaural expansion, and pivoting all provide better quality responses to queries when tested in standard evaluation frameworks. but such enhancements can add to the cost of evaluating queries. in this paper we consider the pragmatic issue of how to improve the cost-effectiveness of searching. we describe a new inverted file structure using quantized weights that provides superior retrieval effectiveness compared to conventional inverted file structures when early termination heuristics are employed. that is, we are able to reach similar effectiveness levels with less computational cost, and so provide a better cost/performance compromise than previous inverted file organisations. surrounding text:even with an efficient representation of postings [16]<2>, the list for a common term can require several megabytes for each gigabyte of indexed text. worse, heuristics such as frequencyordering [13]<2> or impact-ordering [***]<2> are not of value, as the frequency of a word in a document does not determine its frequency of participation in a particular phrase. a crude solution is to use stopping, as is done by some widely-used web search engines (the google search engine, for example, neglects common words in queries), but this approach means that a small number of queries cannot be evaluated, while many more evaluate incorrectly [12]<2> influence:3 type:2 pair index:52 citer id:38 citer title:efficient phrase querying with an auxiliary index citer abstract:search engines need to evaluate queries extremely fast, a challenging task given the vast quantities of data being indexed. a significant proportion of the queries posed to search engines involve phrases. in this paper we consider how phrase queries can be efficiently supported with low disk overheads. previous research has shown that phrase queries can be rapidly evaluated using nextword indexes, but these indexes are twice as large as conventional inverted files. we propose a combination of nextword indexes with inverted files as a solution to this problem. our experiments show that combined use of an auxiliary nextword index and a conventional inverted file allow evaluation of phrase queries in half the time required to evaluate such queries with an inverted file alone, and the space overhead is only 10% of the size of the inverted file. further time savings are available with only slight increases in disk requirements. general terms indexing, query evaluation keywords inverted indexes, nextword indexes, evaluation efficiency, index size, stopping, phrase query citee id:335 citee title:compaction techniques for nextword indexes citee abstract: most queries to text search engines are ranked or boolean phrase querying is a powerful technique for re ning searches, but is expensive to implement on conventional indexes in other work, a nextword index has been proposed as a structure speci cally designed for phrase queries nextword indexes are, however, relatively large in this paper we introduce new compaction techniques for nextword indexes in contrast to most index compression schemes, these techniques are lossy, yet as we show allow full resolution of phrase queries without false match checking we show experimentally that our novel techniques lead to signi cant savings in index size surrounding text:another solution is to index phrases directly, but the set of word pairs in a text collection is large and an index on such phrases difficult to manage. in recent work, nextword indexes were proposed as a way of supporting phrase queries and phrase browsing [***, 3, 15]<2>. in a nextword index, for each index term or firstword there is a list of the words or nextwords that follow that term, together with the documents and word positions at which the firstword and nextword occur as a pair influence:1 type:2 pair index:53 citer id:38 citer title:efficient phrase querying with an auxiliary index citer abstract:search engines need to evaluate queries extremely fast, a challenging task given the vast quantities of data being indexed. a significant proportion of the queries posed to search engines involve phrases. in this paper we consider how phrase queries can be efficiently supported with low disk overheads. previous research has shown that phrase queries can be rapidly evaluated using nextword indexes, but these indexes are twice as large as conventional inverted files. we propose a combination of nextword indexes with inverted files as a solution to this problem. our experiments show that combined use of an auxiliary nextword index and a conventional inverted file allow evaluation of phrase queries in half the time required to evaluate such queries with an inverted file alone, and the space overhead is only 10% of the size of the inverted file. further time savings are available with only slight increases in disk requirements. general terms indexing, query evaluation keywords inverted indexes, nextword indexes, evaluation efficiency, index size, stopping, phrase query citee id:481 citee title:optimised phrase querying and browsing in text databases citee abstract:most search systems for querying large document collectioneor example, web search enginesare based on well-understood information retrieval principles. these systems are both eficient and effective in finding answers to many user information needs, expressed through informal ranked or structured boolean queries. phrase querying and browsing are additional techniques that can augment or replace conventional querying tools. in this paper, we propose optimisations for phrase quelying with a nextword index, an eficient structure,for phrase-based searching. we show that careful consideration of which search terms are evaluated in a query plan and optimisation of the order of evaluation of the plan can reduce query evaluation costs by more than a factor of jive, we conclude that, for phrase querying and browsing with nextword indexes, an ordered query plan should be used for all browsing and querying. moreover, we show that optimised phrase querying is practical on large text collections. surrounding text:another solution is to index phrases directly, but the set of word pairs in a text collection is large and an index on such phrases difficult to manage. in recent work, nextword indexes were proposed as a way of supporting phrase queries and phrase browsing [2, ***, 15]<2>. in a nextword index, for each index term or firstword there is a list of the words or nextwords that follow that term, together with the documents and word positions at which the firstword and nextword occur as a pair influence:1 type:2 pair index:54 citer id:38 citer title:efficient phrase querying with an auxiliary index citer abstract:search engines need to evaluate queries extremely fast, a challenging task given the vast quantities of data being indexed. a significant proportion of the queries posed to search engines involve phrases. in this paper we consider how phrase queries can be efficiently supported with low disk overheads. previous research has shown that phrase queries can be rapidly evaluated using nextword indexes, but these indexes are twice as large as conventional inverted files. we propose a combination of nextword indexes with inverted files as a solution to this problem. our experiments show that combined use of an auxiliary nextword index and a conventional inverted file allow evaluation of phrase queries in half the time required to evaluate such queries with an inverted file alone, and the space overhead is only 10% of the size of the inverted file. further time savings are available with only slight increases in disk requirements. general terms indexing, query evaluation keywords inverted indexes, nextword indexes, evaluation efficiency, index size, stopping, phrase query citee id:482 citee title:interactive internet search: keyword, directory and query reformulation mechanisms compared citee abstract:this article compares search effectiveness when usingquery-based internet search (via the google searchengine), directory-based search (via yahoo) and phrasebasedquery reformulation assisted search (via thehyperindex browser) by means of a controlled, userbasedexperimental studyfi the focus was to evaluateaspects of the search processfi cognitive load wasmeasured using a secondary digit-monitoring task toquantify the effort of the user in various search states;independent surrounding text:it follows that phrase query evaluation can be extremely fast. nextword indexes also have the benefit of allowing phrase browsing or phrase querying [***, 15]<2>. given a sequence of words, the index can be used to identify which words follow the sequence, thus providing an alternative mechanism for searching text collections influence:3 type:2 pair index:55 citer id:38 citer title:efficient phrase querying with an auxiliary index citer abstract:search engines need to evaluate queries extremely fast, a challenging task given the vast quantities of data being indexed. a significant proportion of the queries posed to search engines involve phrases. in this paper we consider how phrase queries can be efficiently supported with low disk overheads. previous research has shown that phrase queries can be rapidly evaluated using nextword indexes, but these indexes are twice as large as conventional inverted files. we propose a combination of nextword indexes with inverted files as a solution to this problem. our experiments show that combined use of an auxiliary nextword index and a conventional inverted file allow evaluation of phrase queries in half the time required to evaluate such queries with an inverted file alone, and the space overhead is only 10% of the size of the inverted file. further time savings are available with only slight increases in disk requirements. general terms indexing, query evaluation keywords inverted indexes, nextword indexes, evaluation efficiency, index size, stopping, phrase query citee id:483 citee title:relevance ranking for one- to three-term queries citee abstract:most search systems for querying large document collectioneor example, web search enginesare based on well-understood information retrieval principles. these systems are both eficient and effective in finding answers to many user information needs, expressed through informal ranked or structured boolean queries. phrase querying and browsing are additional techniques that can augment or replace conventional querying tools. in this paper, we propose optimisations for phrase quelying with a nextword index, an eficient structure,for phrase-based searching. we show that careful consideration of which search terms are evaluated in a query plan and optimisation of the order of evaluation of the plan can reduce query evaluation costs by more than a factor of jive, we conclude that, for phrase querying and browsing with nextword indexes, an ordered query plan should be used for all browsing and querying. moreover, we show that optimised phrase querying is practical on large text collections. surrounding text:the remaining entries are documents and word positions at which the phrase occurs. similar approaches have been described elsewhere [***, 10]<2>. summarising, phrase queries are evaluated as follows. one possibility is to use a conventional inverted index in which the terms are word pairs. another way to support phrase based query modes is to index and store phrases directly [8]<2> or simply by using an inverted index and approximating phrases through a ranked query technique [***, 10]<2>. greater efficiency, with no additional in-memory space overheads, is possible with a special-purpose structure, the nextword index [15]<2>, where search structures are used to accelerate processing of word pairs influence:1 type:2 pair index:56 citer id:38 citer title:efficient phrase querying with an auxiliary index citer abstract:search engines need to evaluate queries extremely fast, a challenging task given the vast quantities of data being indexed. a significant proportion of the queries posed to search engines involve phrases. in this paper we consider how phrase queries can be efficiently supported with low disk overheads. previous research has shown that phrase queries can be rapidly evaluated using nextword indexes, but these indexes are twice as large as conventional inverted files. we propose a combination of nextword indexes with inverted files as a solution to this problem. our experiments show that combined use of an auxiliary nextword index and a conventional inverted file allow evaluation of phrase queries in half the time required to evaluate such queries with an inverted file alone, and the space overhead is only 10% of the size of the inverted file. further time savings are available with only slight increases in disk requirements. general terms indexing, query evaluation keywords inverted indexes, nextword indexes, evaluation efficiency, index size, stopping, phrase query citee id:484 citee title:phrase recognition and expansion for short, precision-biased queries based on a query log citee abstract:in this paper we examine the question of query parsing for world wide web queries and present a novel method for phrase recognition and expansionfi given a training corpus of approximately 16 million web queries and a handwritten context-free grammar, the em algorithm is used to estimate the parameters of a probabilistic context-free grammar (pcfg) with a system developed by carroll fi we use the pcfg to compute the most probable parse for a user query, reflecting linguistic structure and word surrounding text:on the web, most queries consist of simple lists of words, however, a significant fraction of the queries include phrases, where the user has indicated that some of the query terms must be adjacent, typically by enclosing them in quotation marks. phrases have the advantage of being unambigous concept markers and are therefore viewed as a valuable addition to ranked queries [6, ***]<1>. in this paper, we explore new techniques for efficient evaluation of phrase queries influence:2 type:3 pair index:57 citer id:38 citer title:efficient phrase querying with an auxiliary index citer abstract:search engines need to evaluate queries extremely fast, a challenging task given the vast quantities of data being indexed. a significant proportion of the queries posed to search engines involve phrases. in this paper we consider how phrase queries can be efficiently supported with low disk overheads. previous research has shown that phrase queries can be rapidly evaluated using nextword indexes, but these indexes are twice as large as conventional inverted files. we propose a combination of nextword indexes with inverted files as a solution to this problem. our experiments show that combined use of an auxiliary nextword index and a conventional inverted file allow evaluation of phrase queries in half the time required to evaluate such queries with an inverted file alone, and the space overhead is only 10% of the size of the inverted file. further time savings are available with only slight increases in disk requirements. general terms indexing, query evaluation keywords inverted indexes, nextword indexes, evaluation efficiency, index size, stopping, phrase query citee id:485 citee title:results and challenges in web search evaluation citee abstract:a frozen 18.5 million page snapshot of part of the web has been created to enable andencourage meaningful and reproducible evaluation of web search systems and techniques.this collection is being used in an evaluation framework within the text retrieval con-ference (trec) and will hopefully provide convincing answers to questions such as, ��canlink information result in better rankings?��, ��do longer queries result in better answers?��,and, ��do trec systems work well on web data?�� the snapshot and associated evaluationmethods are described and an invitation is extended to participate. preliminary results arepresented for an effectivess comparison of six trec systems working on the snapshot col-lection against five well-known web search systems working over the current web. thesesuggest that the standard of document rankings produced by public web search engines isby no means state-of-the-art. surrounding text:9 gb of html containing about 8. 3 gb of text (drawn from the trec large web track [***]<3>). table 1 shows the size of the index with a range of levels of stopping influence:3 type:3 pair index:58 citer id:38 citer title:efficient phrase querying with an auxiliary index citer abstract:search engines need to evaluate queries extremely fast, a challenging task given the vast quantities of data being indexed. a significant proportion of the queries posed to search engines involve phrases. in this paper we consider how phrase queries can be efficiently supported with low disk overheads. previous research has shown that phrase queries can be rapidly evaluated using nextword indexes, but these indexes are twice as large as conventional inverted files. we propose a combination of nextword indexes with inverted files as a solution to this problem. our experiments show that combined use of an auxiliary nextword index and a conventional inverted file allow evaluation of phrase queries in half the time required to evaluate such queries with an inverted file alone, and the space overhead is only 10% of the size of the inverted file. further time savings are available with only slight increases in disk requirements. general terms indexing, query evaluation keywords inverted indexes, nextword indexes, evaluation efficiency, index size, stopping, phrase query citee id:43 citee title:self-indexing inverted files for fast text retrieval citee abstract:query processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. here we show that query response time for conjunctive boolean queries and for informal ranked queries can be dramatically reduced, at little cost in terms of storage, by the inclusion of an internal index in each inverted list. this method has been applied in a retrieval system for a collection of nearly two million short documents. our experimental results show that the selfindexing strategy adds less than 20% to the size of the inverted file, but, for boolean queries of 5{10 terms, can reduce processing time to under one fifth of the previous cost. similarly, ranked queries of 40{50 terms can be evaluated in as little as 25% of the previous time, with little or no loss of retrieval e ectiveness surrounding text:a simple heuristic to address this problem is to directly merge the inverted lists rather than decode them in turn. on the one hand, merging has the disadvantage that techniques such as skipping [***]<2> cannot be as easily used to reduce processing costs (although as we discuss later skipping does not necessarily yield significant benefits). on the other hand, merging of at least some of the inverted lists is probably the only viable option when all the query terms are moderately common. o sets only have to be decoded when there is a document match, but they still have to be retrieved. other techniques do have the potential to reduce query evaluation time, in particular skipping [***]<2>, in which additional information is placed in inverted lists to reduce the decoding required in regions in the list that cannot contain postings that will match documents that have been identified as potential matches. on older machines, on which cpu cycles were relatively scarce, skipping could yield substantial gains influence:3 type:2 pair index:59 citer id:38 citer title:efficient phrase querying with an auxiliary index citer abstract:search engines need to evaluate queries extremely fast, a challenging task given the vast quantities of data being indexed. a significant proportion of the queries posed to search engines involve phrases. in this paper we consider how phrase queries can be efficiently supported with low disk overheads. previous research has shown that phrase queries can be rapidly evaluated using nextword indexes, but these indexes are twice as large as conventional inverted files. we propose a combination of nextword indexes with inverted files as a solution to this problem. our experiments show that combined use of an auxiliary nextword index and a conventional inverted file allow evaluation of phrase queries in half the time required to evaluate such queries with an inverted file alone, and the space overhead is only 10% of the size of the inverted file. further time savings are available with only slight increases in disk requirements. general terms indexing, query evaluation keywords inverted indexes, nextword indexes, evaluation efficiency, index size, stopping, phrase query citee id:486 citee title:scalable browsing for large collections: a case study citee abstract:phrase browsing techniques use phrases extractedautomatically from a large information collection as a basisfor browsing and accessing itfi this paper describes a casestudy that uses an automatically constructed phrasehierarchy to facilitate browsing of an ordinary large websitefi phrases are extracted from the full text using a novelcombination of rudimentary syntactic processing andsequential grammar induction techniquesfi the interface issimple, robust and easy to usefito convey a surrounding text:worse, heuristics such as frequencyordering [13]<2> or impact-ordering [1]<2> are not of value, as the frequency of a word in a document does not determine its frequency of participation in a particular phrase. a crude solution is to use stopping, as is done by some widely-used web search engines (the google search engine, for example, neglects common words in queries), but this approach means that a small number of queries cannot be evaluated, while many more evaluate incorrectly [***]<2>. another solution is to index phrases directly, but the set of word pairs in a text collection is large and an index on such phrases difficult to manage influence:2 type:2 pair index:60 citer id:38 citer title:efficient phrase querying with an auxiliary index citer abstract:search engines need to evaluate queries extremely fast, a challenging task given the vast quantities of data being indexed. a significant proportion of the queries posed to search engines involve phrases. in this paper we consider how phrase queries can be efficiently supported with low disk overheads. previous research has shown that phrase queries can be rapidly evaluated using nextword indexes, but these indexes are twice as large as conventional inverted files. we propose a combination of nextword indexes with inverted files as a solution to this problem. our experiments show that combined use of an auxiliary nextword index and a conventional inverted file allow evaluation of phrase queries in half the time required to evaluate such queries with an inverted file alone, and the space overhead is only 10% of the size of the inverted file. further time savings are available with only slight increases in disk requirements. general terms indexing, query evaluation keywords inverted indexes, nextword indexes, evaluation efficiency, index size, stopping, phrase query citee id:44 citee title:filtered document retrieval with frequency-sorted indexes citee abstract:ranking techniques are e ective at finding answers in document collections but can be expensive to evaluate. we propose an evaluation technique that uses early recognition of which documents are likely to be highly ranked to reduce costs; for our test data, queries are evaluated in 2% of the memory of the standard implementation without degradation in retrieval e ectiveness. cpu time and disk traffic can also be dramatically reduced by designing inverted indexes explicitly to support the technique. the principle of the index design is that inverted lists are sorted by decreasing within-document frequency rather than by document number, and this method experimentally reduces cpu time and disk traffic to around one third of the original requirement. we also show that frequency sorting can lead to a net reduction in index size, regardless of whether the index is compressed surrounding text:even with an efficient representation of postings [16]<2>, the list for a common term can require several megabytes for each gigabyte of indexed text. worse, heuristics such as frequencyordering [***]<2> or impact-ordering [1]<2> are not of value, as the frequency of a word in a document does not determine its frequency of participation in a particular phrase. a crude solution is to use stopping, as is done by some widely-used web search engines (the google search engine, for example, neglects common words in queries), but this approach means that a small number of queries cannot be evaluated, while many more evaluate incorrectly [12]<2>. for ranked query processing it is possible to predict which postings in each inverted list are most likely to be of value, and move these to the front of the inverted list. techniques for such list modification include frequency-ordering [***]<2> and impact-ordering [2]. with these techniques, only the first of the inverted lists need be fetched during evaluation of most queries, greatly reducing costs influence:2 type:2 pair index:61 citer id:38 citer title:efficient phrase querying with an auxiliary index citer abstract:search engines need to evaluate queries extremely fast, a challenging task given the vast quantities of data being indexed. a significant proportion of the queries posed to search engines involve phrases. in this paper we consider how phrase queries can be efficiently supported with low disk overheads. previous research has shown that phrase queries can be rapidly evaluated using nextword indexes, but these indexes are twice as large as conventional inverted files. we propose a combination of nextword indexes with inverted files as a solution to this problem. our experiments show that combined use of an auxiliary nextword index and a conventional inverted file allow evaluation of phrase queries in half the time required to evaluate such queries with an inverted file alone, and the space overhead is only 10% of the size of the inverted file. further time savings are available with only slight increases in disk requirements. general terms indexing, query evaluation keywords inverted indexes, nextword indexes, evaluation efficiency, index size, stopping, phrase query citee id:342 citee title:searching the web: the public and their queries citee abstract:in studying actual web searching by the public at large, we analyzed over one million web queries by users of the excite search enginefi we found that most people use few search terms, few modified queries, view few web pages, and rarely use advanced search featuresfi a small number of search terms are used with high frequency, and a great many terms are unique; the language of web queries is distinctivefi queries about recreation and en-tertainment rank highestfi findings are compared to data from two other large studies of web queriesfi this study provides an insight into the public practices and choices in web searchingfi surrounding text:the nextword index takes the middle ground by indexing pairs of words and, therefore, is particularly good at resolving phrase queries containing two or more words. as noted above and observed elsewhere, the commonest number of words in a phrase is two [***]<3>. a nextword index is a three-level structure influence:2 type:3 pair index:62 citer id:38 citer title:efficient phrase querying with an auxiliary index citer abstract:search engines need to evaluate queries extremely fast, a challenging task given the vast quantities of data being indexed. a significant proportion of the queries posed to search engines involve phrases. in this paper we consider how phrase queries can be efficiently supported with low disk overheads. previous research has shown that phrase queries can be rapidly evaluated using nextword indexes, but these indexes are twice as large as conventional inverted files. we propose a combination of nextword indexes with inverted files as a solution to this problem. our experiments show that combined use of an auxiliary nextword index and a conventional inverted file allow evaluation of phrase queries in half the time required to evaluate such queries with an inverted file alone, and the space overhead is only 10% of the size of the inverted file. further time savings are available with only slight increases in disk requirements. general terms indexing, query evaluation keywords inverted indexes, nextword indexes, evaluation efficiency, index size, stopping, phrase query citee id:487 citee title:what's next? index structures for efficient phrase querying citee abstract:text retrieval systems are used to fetch documents from large text collections, using queries consisting of words and word sequences surrounding text:another solution is to index phrases directly, but the set of word pairs in a text collection is large and an index on such phrases difficult to manage. in recent work, nextword indexes were proposed as a way of supporting phrase queries and phrase browsing [2, 3, ***]<2>. in a nextword index, for each index term or firstword there is a list of the words or nextwords that follow that term, together with the documents and word positions at which the firstword and nextword occur as a pair. another way to support phrase based query modes is to index and store phrases directly [8]<2> or simply by using an inverted index and approximating phrases through a ranked query technique [5, 10]<2>. greater efficiency, with no additional in-memory space overheads, is possible with a special-purpose structure, the nextword index [***]<2>, where search structures are used to accelerate processing of word pairs. the nextword index takes the middle ground by indexing pairs of words and, therefore, is particularly good at resolving phrase queries containing two or more words. it follows that phrase query evaluation can be extremely fast. nextword indexes also have the benefit of allowing phrase browsing or phrase querying [4, ***]<2>. given a sequence of words, the index can be used to identify which words follow the sequence, thus providing an alternative mechanism for searching text collections influence:1 type:2 pair index:63 citer id:38 citer title:efficient phrase querying with an auxiliary index citer abstract:search engines need to evaluate queries extremely fast, a challenging task given the vast quantities of data being indexed. a significant proportion of the queries posed to search engines involve phrases. in this paper we consider how phrase queries can be efficiently supported with low disk overheads. previous research has shown that phrase queries can be rapidly evaluated using nextword indexes, but these indexes are twice as large as conventional inverted files. we propose a combination of nextword indexes with inverted files as a solution to this problem. our experiments show that combined use of an auxiliary nextword index and a conventional inverted file allow evaluation of phrase queries in half the time required to evaluate such queries with an inverted file alone, and the space overhead is only 10% of the size of the inverted file. further time savings are available with only slight increases in disk requirements. general terms indexing, query evaluation keywords inverted indexes, nextword indexes, evaluation efficiency, index size, stopping, phrase query citee id:343 citee title:managing gigabytes: compressing and indexing documents and images citee abstract:in this fully updated second edition of the highly acclaimed managing gigabytes, authors witten, moffat, and bell continue to provide unparalleled coverage of state-of-the-art techniques for compressing and indexing data. whatever your field, if you work with large quantities of information, this book is essential reading--an authoritative theoretical resource and a practical guide to meeting the toughest storage and access challenges. it covers the latest developments in compression and indexing and their application on the web and in digital libraries. it also details dozens of powerful techniques supported by mg, the authors' own system for compressing, storing, and retrieving text, images, and textual images. mg's source code is freely available on the web. surrounding text:this does not mean, however, that the process is fast. even with an efficient representation of postings [***]<2>, the list for a common term can require several megabytes for each gigabyte of indexed text. worse, heuristics such as frequencyordering [13]<2> or impact-ordering [1]<2> are not of value, as the frequency of a word in a document does not determine its frequency of participation in a particular phrase influence:3 type:3 pair index:64 citer id:38 citer title:efficient phrase querying with an auxiliary index citer abstract:search engines need to evaluate queries extremely fast, a challenging task given the vast quantities of data being indexed. a significant proportion of the queries posed to search engines involve phrases. in this paper we consider how phrase queries can be efficiently supported with low disk overheads. previous research has shown that phrase queries can be rapidly evaluated using nextword indexes, but these indexes are twice as large as conventional inverted files. we propose a combination of nextword indexes with inverted files as a solution to this problem. our experiments show that combined use of an auxiliary nextword index and a conventional inverted file allow evaluation of phrase queries in half the time required to evaluate such queries with an inverted file alone, and the space overhead is only 10% of the size of the inverted file. further time savings are available with only slight increases in disk requirements. general terms indexing, query evaluation keywords inverted indexes, nextword indexes, evaluation efficiency, index size, stopping, phrase query citee id:488 citee title:exploring the similarity space citee abstract:ranked queries are used to locate relevant documents in text databasesfi in a ranked query a list of terms is specified, then the documents that most closely match the query are returned---in decreasing order of similarity---as answersfi crucial to the efficacy of ranked querying is the use of a similarity heuristic, a mechanism that assigns a numeric score indicating how closely a document and the query matchfi in this note we explore and categorise a range of similarity heuristics described in surrounding text:the lower level is a set of postings lists, one per index term. following the notation of zobel and mo at [***]<1>, each posting is a triple of the form: hd. fd influence:3 type:3 pair index:65 citer id:40 citer title:static index pruning for information retrieval systems citer abstract:we introduce static index pruning methods that significantly reduce the index size in information retrieval systems. we investigate uniform and term-based methods that each remove selected entries from the index and yet have only a minor effect on retrieval results. in uniform pruning, there is a fixed cutoff threshold, and all index entries whose contribution to relevance scores is bounded above by a given threshold are removed from the index. in term-based pruning, the cutoff threshold is determined for each term, and thus may vary from term to term. we give experimental evidence that for each level of compression, term-based pruning outperforms uniform pruning, under various measures of precision. we present theoretical and experimental evidence that under our term-based pruning scheme, it is possible to prune the index greatly and still get retrieval results that are almost as good as those based on the full index citee id:493 citee title:optimization of inverted vector searches citee abstract:a simple algorithm is presented for increasing the efficiency of information retrieval searches which are implemented using inverted files. this optimization algorithm employs knowledge about the methods used for weighting document and query terms in order to examine as few inverted lists as possible. an extension to the basic algorithm allows greatly increased performance optimization at a modest cost in retrieval effectiveness. experimental runs are made examining several different term weighting models and showing the optimization possible with each. surrounding text:in addition to reducing disk space, our pruning methods decrease both the time required to search the resulting index, and the amount of memory consumed for searching, in a manner similar to that of dynamic pruning techniques that have been described in the literature. these techniques dynamically decide, during query evaluation, whether certain terms or document postings are worth adding to the accumulated document scores, and whether the ranking process should continue or stop [***, 6, 9, 14]<1>. the typical approach of these algorithms is to order query terms by decreasing weight, and to process terms from most to least significant until some stopping condition is met influence:2 type:2 pair index:66 citer id:40 citer title:static index pruning for information retrieval systems citer abstract:we introduce static index pruning methods that significantly reduce the index size in information retrieval systems. we investigate uniform and term-based methods that each remove selected entries from the index and yet have only a minor effect on retrieval results. in uniform pruning, there is a fixed cutoff threshold, and all index entries whose contribution to relevance scores is bounded above by a given threshold are removed from the index. in term-based pruning, the cutoff threshold is determined for each term, and thus may vary from term to term. we give experimental evidence that for each level of compression, term-based pruning outperforms uniform pruning, under various measures of precision. we present theoretical and experimental evidence that under our term-based pruning scheme, it is possible to prune the index greatly and still get retrieval results that are almost as good as those based on the full index citee id:768 citee title:new retrieval approaches using smart: trec 4 citee abstract:the smart information retrieval project emphasizes completely automatic approaches to the understandingand retrieval of large quantities of textfi we continue our work in trec 4, performing runs in therouting, ad-hoc, confused text, interactive, and foreign language environmentsfiintroductionfor over 30 years, the smart project at cornell university has been interested in the analysis, search, andretrieval of heterogeneous text databases, where the vocabulary is allowed to vary widely, and surrounding text:the score of term t for document d depends on the term frequency tf of t in d, the length d of document d, and the inverse number idf of documents containing t in the collection. for example, our system uses the following scoring model based on the model used by the smart system [***]<1> a(t, d) = log(1+tf) log(1+avgtf) log n nt d (1) where avgtf is the average term frequency in d, n is the number of documents in the collection, nt is the number of documents containing t, and d = [0. 8. , tr, where each term ti is associated with a positive weight ��i. for example, in the experiments done for this work, query terms are extracted from trec topics, and the term weights are determined by the following equation (also based on the scoring model used by the smart system [***]<1>): ��i = log(1 + tfi) log(1 + avgtf) (2) here tfi is the term frequency of term ti in the topic and avgtf is the average term frequency in the topic. the score of document d for query q is sq(d) = r  i=1 ��ia(ti, d) influence:3 type:3 pair index:67 citer id:40 citer title:static index pruning for information retrieval systems citer abstract:we introduce static index pruning methods that significantly reduce the index size in information retrieval systems. we investigate uniform and term-based methods that each remove selected entries from the index and yet have only a minor effect on retrieval results. in uniform pruning, there is a fixed cutoff threshold, and all index entries whose contribution to relevance scores is bounded above by a given threshold are removed from the index. in term-based pruning, the cutoff threshold is determined for each term, and thus may vary from term to term. we give experimental evidence that for each level of compression, term-based pruning outperforms uniform pruning, under various measures of precision. we present theoretical and experimental evidence that under our term-based pruning scheme, it is possible to prune the index greatly and still get retrieval results that are almost as good as those based on the full index citee id:655 citee title:indexing by latent semantic analysis citee abstract:a new method for automatic indexing and retrieval is described. the approach is to take advantage of implicit higher-order structure in the association of terms with documents (��semantic structure��) in order to improve the detection of relevant documents on the basis of terms found in queries. the particular technique used is singular-value decomposition, in which a large term by document matrix is decomposed into a set of ca. 100 or- thogonal factors from which the original matrix can be approximated by linear combination. documents are represented by ca. 100 item vectors of factor weights. queries are represented as pseudo-document vectors formed from weighted combinations of terms, and documents with supra-threshold cosine values are re- turned. initial tests find this completely automatic method for retrieval to be promising. surrounding text:thereby, a smaller index size can be attained than is possible by either one of the methods separately. examples of the lossy approach include stopword omission and latent semantic indexing (lsi) [***]<3>. while the primary goal of both stopword omission and lsi is to reduce the noise in the system by pruning terms that deteriorate precision, their practical effect of reducing index size is very relevant to this work influence:2 type:2 pair index:68 citer id:40 citer title:static index pruning for information retrieval systems citer abstract:we introduce static index pruning methods that significantly reduce the index size in information retrieval systems. we investigate uniform and term-based methods that each remove selected entries from the index and yet have only a minor effect on retrieval results. in uniform pruning, there is a fixed cutoff threshold, and all index entries whose contribution to relevance scores is bounded above by a given threshold are removed from the index. in term-based pruning, the cutoff threshold is determined for each term, and thus may vary from term to term. we give experimental evidence that for each level of compression, term-based pruning outperforms uniform pruning, under various measures of precision. we present theoretical and experimental evidence that under our term-based pruning scheme, it is possible to prune the index greatly and still get retrieval results that are almost as good as those based on the full index citee id:498 citee title:retrieving records from a gigabyte of text on a minicomputer using statistical ranking citee abstract:statistically based ranked retrieval of records using keywords provides many advantages over traditional boolean retrieval methods, especially for end users. this approach to retrieval, however, has not seen widespread use in large operational retrieval systems. to show the feasibility of this retrieval methodology, research was done to produce very fast search techniques using these ranking algorithms, and then to test the results against large databases with many end users. the results show not only response times on the order of 1 and l/2 seconds for 806 megabytes of text, but also very favorable user reaction. novice users were able to consistently obtain good search results after 5 minutes of training. additional work was done to devise new indexing techniques to create inverted files for large databases using a minicomputer. these techniques use no sorting, require a working space of only about 20% of the size of the input text, and produce indices that are about 14% of the input text size. surrounding text:in addition to reducing disk space, our pruning methods decrease both the time required to search the resulting index, and the amount of memory consumed for searching, in a manner similar to that of dynamic pruning techniques that have been described in the literature. these techniques dynamically decide, during query evaluation, whether certain terms or document postings are worth adding to the accumulated document scores, and whether the ranking process should continue or stop [1, ***, 9, 14]<1>. the typical approach of these algorithms is to order query terms by decreasing weight, and to process terms from most to least significant until some stopping condition is met influence:2 type:2 pair index:69 citer id:40 citer title:static index pruning for information retrieval systems citer abstract:we introduce static index pruning methods that significantly reduce the index size in information retrieval systems. we investigate uniform and term-based methods that each remove selected entries from the index and yet have only a minor effect on retrieval results. in uniform pruning, there is a fixed cutoff threshold, and all index entries whose contribution to relevance scores is bounded above by a given threshold are removed from the index. in term-based pruning, the cutoff threshold is determined for each term, and thus may vary from term to term. we give experimental evidence that for each level of compression, term-based pruning outperforms uniform pruning, under various measures of precision. we present theoretical and experimental evidence that under our term-based pruning scheme, it is possible to prune the index greatly and still get retrieval results that are almost as good as those based on the full index citee id:500 citee title:full text indexing based on lexical relations: an application: software libraries citee abstract:in contrast to other kinds of libraries, software libraries need to be conceptually organized. when looking for a component, the main concern of users is the functionality of the desired component; implementation details are secondary. software reuse would be enhanced with conceptually organized large libraries of software components. in this paper, we present guru, a tool that allows automatical building of such large software libraries from documented software components. we focus here on guru's indexing component which extracts conceptual attributes from natural language documentation. this indexing method is based on words' co-occurrences. it first uses extract, a co-occurrence knowledge compiler for extracting potential attributes from textual documents. conceptually relevant collocations are then selected according to their resolving power, which scales down the noise due to context words. this fully automated indexing tool thus goes further than keyword-based tools in the understanding of a document without the brittleness of knowledge based tools. the indexing component of guru is fully implemented, and some results are given in the paper surrounding text:we report results from some experiments we conducted to evaluate how the pruning algorithms behave, both for short queries (where our theory applies) and for long queries. for our experiments we use juru, a java version of the guru [***]<3> search engine, developed at the ibm research lab in haifa. juru does not apply any compression methods on its inverted files, except stopword omission, and uses the tf �� idf formula described in equation 1 influence:3 type:3 pair index:70 citer id:40 citer title:static index pruning for information retrieval systems citer abstract:we introduce static index pruning methods that significantly reduce the index size in information retrieval systems. we investigate uniform and term-based methods that each remove selected entries from the index and yet have only a minor effect on retrieval results. in uniform pruning, there is a fixed cutoff threshold, and all index entries whose contribution to relevance scores is bounded above by a given threshold are removed from the index. in term-based pruning, the cutoff threshold is determined for each term, and thus may vary from term to term. we give experimental evidence that for each level of compression, term-based pruning outperforms uniform pruning, under various measures of precision. we present theoretical and experimental evidence that under our term-based pruning scheme, it is possible to prune the index greatly and still get retrieval results that are almost as good as those based on the full index citee id:501 citee title:fast ranking in limited space citee abstract:ranking techniques have long been suggested as alternatives to more conventionalboolean methods for searching document collectionsfi the costof computing a ranking is, however, greater than the cost of performing aboolean search, in terms of both memory space and processing timefi herewe consider the resources required by the cosine method of ranking, andshow that with a careful application of indexing and selection techniquesboth the space and time required by ranking can be substantially surrounding text:in addition to reducing disk space, our pruning methods decrease both the time required to search the resulting index, and the amount of memory consumed for searching, in a manner similar to that of dynamic pruning techniques that have been described in the literature. these techniques dynamically decide, during query evaluation, whether certain terms or document postings are worth adding to the accumulated document scores, and whether the ranking process should continue or stop [1, 6, ***, 14]<1>. the typical approach of these algorithms is to order query terms by decreasing weight, and to process terms from most to least significant until some stopping condition is met influence:2 type:2 pair index:71 citer id:40 citer title:static index pruning for information retrieval systems citer abstract:we introduce static index pruning methods that significantly reduce the index size in information retrieval systems. we investigate uniform and term-based methods that each remove selected entries from the index and yet have only a minor effect on retrieval results. in uniform pruning, there is a fixed cutoff threshold, and all index entries whose contribution to relevance scores is bounded above by a given threshold are removed from the index. in term-based pruning, the cutoff threshold is determined for each term, and thus may vary from term to term. we give experimental evidence that for each level of compression, term-based pruning outperforms uniform pruning, under various measures of precision. we present theoretical and experimental evidence that under our term-based pruning scheme, it is possible to prune the index greatly and still get retrieval results that are almost as good as those based on the full index citee id:44 citee title:filtered document retrieval with frequency-sorted indexes citee abstract:ranking techniques are e ective at finding answers in document collections but can be expensive to evaluate. we propose an evaluation technique that uses early recognition of which documents are likely to be highly ranked to reduce costs; for our test data, queries are evaluated in 2% of the memory of the standard implementation without degradation in retrieval e ectiveness. cpu time and disk traffic can also be dramatically reduced by designing inverted indexes explicitly to support the technique. the principle of the index design is that inverted lists are sorted by decreasing within-document frequency rather than by document number, and this method experimentally reduces cpu time and disk traffic to around one third of the original requirement. we also show that frequency sorting can lead to a net reduction in index size, regardless of whether the index is compressed surrounding text:among the stopping conditions are a threshold on the number of document accumulators used, and a threshold on term frequencies. while these schemes select terms, and hence whole posting lists, for processing or rejection, persin et al$ [10, ***]<2> propose a method that prunes entries within these lists. in this approach, the processing of a particular query term is terminated when a stopping condition is met influence:3 type:2 pair index:72 citer id:40 citer title:static index pruning for information retrieval systems citer abstract:we introduce static index pruning methods that significantly reduce the index size in information retrieval systems. we investigate uniform and term-based methods that each remove selected entries from the index and yet have only a minor effect on retrieval results. in uniform pruning, there is a fixed cutoff threshold, and all index entries whose contribution to relevance scores is bounded above by a given threshold are removed from the index. in term-based pruning, the cutoff threshold is determined for each term, and thus may vary from term to term. we give experimental evidence that for each level of compression, term-based pruning outperforms uniform pruning, under various measures of precision. we present theoretical and experimental evidence that under our term-based pruning scheme, it is possible to prune the index greatly and still get retrieval results that are almost as good as those based on the full index citee id:813 citee title:overview of the seventh text retrieval conference (trec-7) citee abstract:this paper serves as an introduction to the research described in detail in the remainder of the volumefi itconcentrates mainly on the main task, ad hoc retrieval, which is defined the next sectionfi details regardingthe test collections and evaluation methodology used in trec follow in sections 3 and 4, while section 5provides an overview of the ad hoc retrieval resultsfi in addition to the main ad hoc task, trec-7 containedseven "tracks," tasks that focus research on particular subproblems surrounding text:juru does not apply any compression methods on its inverted files, except stopword omission, and uses the tf �� idf formula described in equation 1. we use the trec collection [***]<3> for our experiments. we used the los-angeles times (lat) data given in trec, which contains about 132,000 documents (476mb), and evaluated our methods on the ad hoc tasks for trec-7, with topics 401-450. 1 performance measurement in order to measure the accuracy of the pruning algorithms, we compare the results obtained from the pruned index against the results obtained from the original index. we use the standard average precision, and precision at k (p@k) measures [***]<3>. in addition, we want a method to compare the top results in one list (obtained from the pruned index) against the top results in another list (obtained from the original index) influence:3 type:3 pair index:73 citer id:43 citer title:self-indexing inverted files for fast text retrieval citer abstract:query processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. here we show that query response time for conjunctive boolean queries and for informal ranked queries can be dramatically reduced, at little cost in terms of storage, by the inclusion of an internal index in each inverted list. this method has been applied in a retrieval system for a collection of nearly two million short documents. our experimental results show that the selfindexing strategy adds less than 20% to the size of the inverted file, but, for boolean queries of 5{10 terms, can reduce processing time to under one fifth of the previous cost. similarly, ranked queries of 40{50 terms can be evaluated in as little as 25% of the previous time, with little or no loss of retrieval e ectiveness citee id:394 citee title:data compression in fulltext retrieval systems citee abstract:describes compression methods for components of full-text systems such as text databases on cd-rom. topics discussed include storage media; structures for full-text retrieval, including indexes, inverted files, and bitmaps; compression tools; memory requirements during retrieval; and ranking and information retrieval. surrounding text:without compression, an inverted file can easily be as large or larger than the text it indexes. compression results in a net space reduction of as much as 80% of the inverted file size [***]<1>, but even with fast decompressiondecoding at approximately 400,000 numbers per second on a sun sparc 10it involves a substantial overhead on processing time. here we consider how to reduce these space and time costs, with particular emphasis on environments in which index compression has been used. 1 compressing inverted files techniques for compressing inverted lists, or equivalently bitmaps, have been described by many authors, including bell et al . [***]<1>, bookstein, klein, and raita [2]<1>, choueka, fraenkel, and klein [4]<1>, fraenkel and klein [12]<1>, klein, bookstein, and deerwester [21]<1>, and lino and stanfill [22]<1>. faloutsos described the application of similar techniques to the compression of sparse signatures [8, 9]<2> influence:2 type:2 pair index:74 citer id:43 citer title:self-indexing inverted files for fast text retrieval citer abstract:query processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. here we show that query response time for conjunctive boolean queries and for informal ranked queries can be dramatically reduced, at little cost in terms of storage, by the inclusion of an internal index in each inverted list. this method has been applied in a retrieval system for a collection of nearly two million short documents. our experimental results show that the selfindexing strategy adds less than 20% to the size of the inverted file, but, for boolean queries of 5{10 terms, can reduce processing time to under one fifth of the previous cost. similarly, ranked queries of 40{50 terms can be evaluated in as little as 25% of the previous time, with little or no loss of retrieval e ectiveness citee id:741 citee title:model based concordance compression citee abstract:the authors discuss concordance compression using the framework now customary in compression theory. they begin by creating a mathematical model of concordance generation, and then use optimal compression engines, such as huffman or arithmetic coding, to do the actual compression. it should be noted that in the context of a static information retrieval system, compression and decompression are not symmetrical tasks. compression is done only once, while building the system, whereas decompression is needed during the processing of every query and directly affects the response time. one may thus use extensive and costly preprocessing for compression, provided reasonably fast decompression methods are possible. moreover, compression is applied to the full files (text, concordance, etc.), but decompression is needed only for (possibly many) short pieces, which may be accessed at random by means of pointers to their exact locations. therefore the use of adaptive methods based on tables that systematically change from the beginning to the end of the file is ruled out. however, their concern is less the speed of encoding or decoding than relating concordance compression conceptually to the modern approach of data compression, and testing the effectiveness of their models surrounding text:1 compressing inverted files techniques for compressing inverted lists, or equivalently bitmaps, have been described by many authors, including bell et al . [1]<1>, bookstein, klein, and raita [***]<1>, choueka, fraenkel, and klein [4]<1>, fraenkel and klein [12]<1>, klein, bookstein, and deerwester [21]<1>, and lino and stanfill [22]<1>. faloutsos described the application of similar techniques to the compression of sparse signatures [8, 9]<2> influence:2 type:2 pair index:75 citer id:43 citer title:self-indexing inverted files for fast text retrieval citer abstract:query processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. here we show that query response time for conjunctive boolean queries and for informal ranked queries can be dramatically reduced, at little cost in terms of storage, by the inclusion of an internal index in each inverted list. this method has been applied in a retrieval system for a collection of nearly two million short documents. our experimental results show that the selfindexing strategy adds less than 20% to the size of the inverted file, but, for boolean queries of 5{10 terms, can reduce processing time to under one fifth of the previous cost. similarly, ranked queries of 40{50 terms can be evaluated in as little as 25% of the previous time, with little or no loss of retrieval e ectiveness citee id:493 citee title:optimization of inverted vector searches citee abstract:a simple algorithm is presented for increasing the efficiency of information retrieval searches which are implemented using inverted files. this optimization algorithm employs knowledge about the methods used for weighting document and query terms in order to examine as few inverted lists as possible. an extension to the basic algorithm allows greatly increased performance optimization at a modest cost in retrieval effectiveness. experimental runs are made examining several different term weighting models and showing the optimization possible with each. surrounding text:ranking techniques can also be supported by inverted files. when the documents are stored in a database that is indexed by an inverted file several additional structures must be used if evaluation is to be fast [***, 18, 31]<1>. these include a weight for each word in the vocabulary. a table of mathematical symbols is provided at the end of the paper. 2 document databases in an inverted file document database, each distinct word in the database is held in a vocabulary [***, 11, 18, 24, 25, 31]<1>. the vocabulary entry for each word contains an address pointer to an inverted list (also known as a postings list ), a contiguous list of the documents containing the word influence:2 type:3 pair index:76 citer id:43 citer title:self-indexing inverted files for fast text retrieval citer abstract:query processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. here we show that query response time for conjunctive boolean queries and for informal ranked queries can be dramatically reduced, at little cost in terms of storage, by the inclusion of an internal index in each inverted list. this method has been applied in a retrieval system for a collection of nearly two million short documents. our experimental results show that the selfindexing strategy adds less than 20% to the size of the inverted file, but, for boolean queries of 5{10 terms, can reduce processing time to under one fifth of the previous cost. similarly, ranked queries of 40{50 terms can be evaluated in as little as 25% of the previous time, with little or no loss of retrieval e ectiveness citee id:574 citee title:fundamentals of database systems citee abstract:clear explanations of theory and design, broad coverage of models and real systems, and an up-to-date introduction to modern database technologies result in a leading introduction to database systems. with fresh new problems and a new lab manual, students get more opportunities to practice the fundamentals of design and implementation. more real-world examples serve as engaging, practical illustrations of database concepts. the fifth edition maintains its coverage of the most popular database topics, including sql, security, data mining, and contains a new chapter on web script programming for databases. surrounding text:ranking, on the other hand, is a process of matching an informal query to the documents and allocating scores to documents according to their degree of similarity to the query [29, 30]<1>. a standard mechanism for supporting boolean queries is an inverted file [***, 11]<1>. an inverted file contains, for each term that appears anywhere in the database, a list of the numbers of the documents containing that term influence:2 type:3 pair index:77 citer id:43 citer title:self-indexing inverted files for fast text retrieval citer abstract:query processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. here we show that query response time for conjunctive boolean queries and for informal ranked queries can be dramatically reduced, at little cost in terms of storage, by the inclusion of an internal index in each inverted list. this method has been applied in a retrieval system for a collection of nearly two million short documents. our experimental results show that the selfindexing strategy adds less than 20% to the size of the inverted file, but, for boolean queries of 5{10 terms, can reduce processing time to under one fifth of the previous cost. similarly, ranked queries of 40{50 terms can be evaluated in as little as 25% of the previous time, with little or no loss of retrieval e ectiveness citee id:196 citee title:access methods for text citee abstract:this paper compares text retrieval methods intended for office systems. the operational requirements of the office environment are discussed, and retrieval methods from database systems and from information retrieval systems are examined. we classify these methods and examine the most interesting representatives of each class. attempts to speed up retrieval with special purpose hardware are also presented, and issues such as approximate string matching and compression are discussed. a qualitative comparison of the examined methods is presented. the signature file method is discussed in more detail surrounding text:[1]<1>, bookstein, klein, and raita [2]<1>, choueka, fraenkel, and klein [4]<1>, fraenkel and klein [12]<1>, klein, bookstein, and deerwester [21]<1>, and lino and stanfill [22]<1>. faloutsos described the application of similar techniques to the compression of sparse signatures [***, 9]<2>. our presentation is based on that of mo at and zobel [26]<1>, who compare a variety of index compression methods influence:2 type:2 pair index:78 citer id:43 citer title:self-indexing inverted files for fast text retrieval citer abstract:query processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. here we show that query response time for conjunctive boolean queries and for informal ranked queries can be dramatically reduced, at little cost in terms of storage, by the inclusion of an internal index in each inverted list. this method has been applied in a retrieval system for a collection of nearly two million short documents. our experimental results show that the selfindexing strategy adds less than 20% to the size of the inverted file, but, for boolean queries of 5{10 terms, can reduce processing time to under one fifth of the previous cost. similarly, ranked queries of 40{50 terms can be evaluated in as little as 25% of the previous time, with little or no loss of retrieval e ectiveness citee id:839 citee title:signature files: an access method for documents and its analytical performance evaluation citee abstract:the signature-file access method for text retrieval is studied. according to this method, documents are stored sequentially in the "text file." ions ("signatures") of the documents are stored in the "signature file." the latter serves as a filter on retrieval: it helps in discarding a large number of nonqualifying documents. in this paper two methods for creating signatures are studied analytically, one based on word signatures and the other on superimposed coding. closed-form formulas are derived for the false-drop probability of the two methods, factors that affect it are studied, and performance comparisons of the two methods based on these formulas are provided. surrounding text:and an uncompressed index would take a similar length of time. for comparison, it is also interesting to estimate the performance of another indexing method advocated for conjunctive boolean queriesthe bitsliced signature file [***, 28]<2>. in this case at least 10 bitslices of 212 kb each (one slice per query term, each of one bit per document in the collection) must be fetched and conjoined influence:2 type:2 pair index:79 citer id:43 citer title:self-indexing inverted files for fast text retrieval citer abstract:query processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. here we show that query response time for conjunctive boolean queries and for informal ranked queries can be dramatically reduced, at little cost in terms of storage, by the inclusion of an internal index in each inverted list. this method has been applied in a retrieval system for a collection of nearly two million short documents. our experimental results show that the selfindexing strategy adds less than 20% to the size of the inverted file, but, for boolean queries of 5{10 terms, can reduce processing time to under one fifth of the previous cost. similarly, ranked queries of 40{50 terms can be evaluated in as little as 25% of the previous time, with little or no loss of retrieval e ectiveness citee id:801 citee title:optimal source codes for geometrically distributed integer alphabets citee abstract:letp(i)= (1 - theta)theta^ibe a probability assignment on the set of nonnegative integers wherethetais an arbitrary real number,0 < theta < 1. we show that an optimal binary source code for this probability assignment is constructed as follows. letlbe the integer satisfyingtheta^l + theta^{l+1} leq 1 < theta^l + theta^{l-1}and represent each nonnegative integeriasi = lj + rwhenj = lfloor i/l rfloor, the integer part ofi/l, andr = mod l. encodejby a unary code (i.e.,jzeros followed by a single one), and encoderby a huffman code, using codewords of lengthlfloor log_2 l rfloor, forr < 2^{lfloor log l+1 rfloor} - l, and lengthlfloor log_2 l rfloor + 1otherwise. an optimal code for the nonnegative integers is the concatenation of those two codes surrounding text:similarly, run-lengths of 10 through to 36 would be assigned codes with a \10" prefix and either a 4-bit or a 5-bit suffix: \0000" for 10 through to \0100" for 14, then \01010" for 15 through to \11111" for 36. golomb [15]<2> and gallager and van voorhis [***]<2> also considered prefix-free encodings of the integers. they showed that coding relative to the vector vg = (b influence:3 type:3 pair index:80 citer id:43 citer title:self-indexing inverted files for fast text retrieval citer abstract:query processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. here we show that query response time for conjunctive boolean queries and for informal ranked queries can be dramatically reduced, at little cost in terms of storage, by the inclusion of an internal index in each inverted list. this method has been applied in a retrieval system for a collection of nearly two million short documents. our experimental results show that the selfindexing strategy adds less than 20% to the size of the inverted file, but, for boolean queries of 5{10 terms, can reduce processing time to under one fifth of the previous cost. similarly, ranked queries of 40{50 terms can be evaluated in as little as 25% of the previous time, with little or no loss of retrieval e ectiveness citee id:564 citee title:run-length encodings citee abstract:this article will discuss about how to implement a very simple compression algorithm rle. due to the fact that it's very easy to see, it can be a good introduction to programmers interested in data compression. rle is mainly used to compress runs of the same byte. however rle can also be good for the first stage of our bwt compressor, because we'll avoid too much time sorting strings that are equal due to the fact that there's a lot of runs with the same bytes in the data. surrounding text:similarly, run-lengths of 10 through to 36 would be assigned codes with a \10" prefix and either a 4-bit or a 5-bit suffix: \0000" for 10 through to \0100" for 14, then \01010" for 15 through to \11111" for 36. golomb [***]<2> and gallager and van voorhis [14]<2> also considered prefix-free encodings of the integers. they showed that coding relative to the vector vg = (b influence:3 type:3 pair index:81 citer id:43 citer title:self-indexing inverted files for fast text retrieval citer abstract:query processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. here we show that query response time for conjunctive boolean queries and for informal ranked queries can be dramatically reduced, at little cost in terms of storage, by the inclusion of an internal index in each inverted list. this method has been applied in a retrieval system for a collection of nearly two million short documents. our experimental results show that the selfindexing strategy adds less than 20% to the size of the inverted file, but, for boolean queries of 5{10 terms, can reduce processing time to under one fifth of the previous cost. similarly, ranked queries of 40{50 terms can be evaluated in as little as 25% of the previous time, with little or no loss of retrieval e ectiveness citee id:840 citee title:storing text retrieval systems on cd-rom: compression and encryption considerations citee abstract:the emergence of the cd-rom as a storage medium for full-text databases raises the question of the maximum size database that can be contained by this medium. as an example, the problem of storing the tr��sor de la langue fran&ccidel;aise on a cd-rom is examined in this paper. the text alone of this database is 700 megabytes long, more than a cd-rom can hold. in addition, the dictionary and concordance needed to access these data must be stored. a further constraint is that some of the material is copyrighted, and it is desirable that such material be difficult to decode except through software provided by the system. pertinent approaches to compression of the various files are reviewed, and the compression of the text is related to the problem of data encryption: specifically, it is shown that, under simple models of text generation, huffman encoding produces a bit-string indistinguishable from a representation of coin flips. surrounding text:1 compressing inverted files techniques for compressing inverted lists, or equivalently bitmaps, have been described by many authors, including bell et al . [1]<1>, bookstein, klein, and raita [2]<1>, choueka, fraenkel, and klein [4]<1>, fraenkel and klein [12]<1>, klein, bookstein, and deerwester [***]<1>, and lino and stanfill [22]<1>. faloutsos described the application of similar techniques to the compression of sparse signatures [8, 9]<2> influence:2 type:2 pair index:82 citer id:43 citer title:self-indexing inverted files for fast text retrieval citer abstract:query processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. here we show that query response time for conjunctive boolean queries and for informal ranked queries can be dramatically reduced, at little cost in terms of storage, by the inclusion of an internal index in each inverted list. this method has been applied in a retrieval system for a collection of nearly two million short documents. our experimental results show that the selfindexing strategy adds less than 20% to the size of the inverted file, but, for boolean queries of 5{10 terms, can reduce processing time to under one fifth of the previous cost. similarly, ranked queries of 40{50 terms can be evaluated in as little as 25% of the previous time, with little or no loss of retrieval e ectiveness citee id:341 citee title:compression of indexes with full positional information in very large text databases citee abstract:this paper describes a combination of compression methods which may be used to reduce the size of inverted indexes for very large text databases. these methods are prefix omission, run-length encoding, and a novel family of numeric representations called n-s coding. using these compression methods on two different text sources (the king james version of the bible and a sample of wall street journal stories), the compressed index occupies less than 40% of the size of the original text, even when both stopwords and numbers are included in the index. the decreased time required for i/o can almost fully compensate for the time needed to uncompress the postings. this research is part of an effort to handle very large text databases on the cm-5, a massively parallel mimd supercomputer. surrounding text:1 compressing inverted files techniques for compressing inverted lists, or equivalently bitmaps, have been described by many authors, including bell et al . [1]<1>, bookstein, klein, and raita [2]<1>, choueka, fraenkel, and klein [4]<1>, fraenkel and klein [12]<1>, klein, bookstein, and deerwester [21]<1>, and lino and stanfill [***]<1>. faloutsos described the application of similar techniques to the compression of sparse signatures [8, 9]<2> influence:2 type:2 pair index:83 citer id:43 citer title:self-indexing inverted files for fast text retrieval citer abstract:query processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. here we show that query response time for conjunctive boolean queries and for informal ranked queries can be dramatically reduced, at little cost in terms of storage, by the inclusion of an internal index in each inverted list. this method has been applied in a retrieval system for a collection of nearly two million short documents. our experimental results show that the selfindexing strategy adds less than 20% to the size of the inverted file, but, for boolean queries of 5{10 terms, can reduce processing time to under one fifth of the previous cost. similarly, ranked queries of 40{50 terms can be evaluated in as little as 25% of the previous time, with little or no loss of retrieval e ectiveness citee id:414 citee title:development of a stemming algorithm citee abstract:a stemming algorithm, a procedure to reduce all words with the same stem to a common form, is useful in many areas of computational linguistics and information retrieval work. while the form of the algorithm varies with its applications, certain linguistic problems are common to any stemming procedure. as a basis for evaluation of previous attempts to deal with these problems, the paper discusses the theoretical and practical attributes of stemming algorithms. a new version of a context-sensitive, longest-match stemming algorithm for english, developed for use in a library information transfer system but of general applications, is then proposed. a major linguistic problem in stemming, variation in spelling of stems, is discussed in some detail and several feasible programmed solutions are outlined, along with sample results of one of these methods surrounding text:4 words per record. and 538,244 distinct words, after folding all letters to lowercase and removal of variant endings using lovin's stemming algorithm [***]<3>. the index comprises 195,935,531 stored hd influence:3 type:3 pair index:84 citer id:43 citer title:self-indexing inverted files for fast text retrieval citer abstract:query processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. here we show that query response time for conjunctive boolean queries and for informal ranked queries can be dramatically reduced, at little cost in terms of storage, by the inclusion of an internal index in each inverted list. this method has been applied in a retrieval system for a collection of nearly two million short documents. our experimental results show that the selfindexing strategy adds less than 20% to the size of the inverted file, but, for boolean queries of 5{10 terms, can reduce processing time to under one fifth of the previous cost. similarly, ranked queries of 40{50 terms can be evaluated in as little as 25% of the previous time, with little or no loss of retrieval e ectiveness citee id:230 citee title:an inverted index implementation citee abstract:a working implementation of an inverted index suitable for real time applications is described. based upon a hash addressed random access organization with variable length records, the index structure and processing algorithms were developed with the aid of a simulation model. some typical simulation results are presented and critical parameters identified. one solution to the inverted key identification problem is proposed, although a severe time penalty accrues from designing software which is not directly dependent upon the main file record format. surrounding text:a table of mathematical symbols is provided at the end of the paper. 2 document databases in an inverted file document database, each distinct word in the database is held in a vocabulary [3, 11, 18, 24, ***, 31]<1>. the vocabulary entry for each word contains an address pointer to an inverted list (also known as a postings list ), a contiguous list of the documents containing the word influence:3 type:3 pair index:85 citer id:43 citer title:self-indexing inverted files for fast text retrieval citer abstract:query processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. here we show that query response time for conjunctive boolean queries and for informal ranked queries can be dramatically reduced, at little cost in terms of storage, by the inclusion of an internal index in each inverted list. this method has been applied in a retrieval system for a collection of nearly two million short documents. our experimental results show that the selfindexing strategy adds less than 20% to the size of the inverted file, but, for boolean queries of 5{10 terms, can reduce processing time to under one fifth of the previous cost. similarly, ranked queries of 40{50 terms can be evaluated in as little as 25% of the previous time, with little or no loss of retrieval e ectiveness citee id:816 citee title:parameterised compression for sparse bitmaps citee abstract:: full-text retrieval systems typically use either a bitmap or an inverted fileto identify which documents contain which words, so that the documents containing anycombination of words can be quickly locatedfi bitmaps of word occurrences are large,but are usually sparse, and thus are amenable to a variety of compression techniquesfihere we consider techniques in which the encoding of each bitvector within the bitmap isparameterised, so that a different code can be used for each bitvectorfi surrounding text:faloutsos described the application of similar techniques to the compression of sparse signatures [8, 9]<2>. our presentation is based on that of mo at and zobel [***]<1>, who compare a variety of index compression methods. to represent each inverted list, the series of di erences between successive numbers is stored as a list of run-lengths or d-gaps. the codes are both prefix-freeno codeword is a prefix of anotherand so unambiguous decoding without backtracking is possible. 4 coding method x elias, elias, ffi golomb, b = 3 1 0, 0, 0,0 2 10,0 100,0 0,10 3 10,1 100,1 0,11 4 110,00 101,00 10,0 5 110,01 101,01 10,10 6 110,10 101,10 10,11 7 110,11 101,11 110,0 8 1110,000 11000,000 110,10 table 1: examples of codes the and ffi codes are instances of a more general coding paradigm as follows [12, ***]<1>. let v be a (possibly infinite) vector of positive integers vi, where pvi  n, the number of documents in the collection influence:2 type:1 pair index:86 citer id:43 citer title:self-indexing inverted files for fast text retrieval citer abstract:query processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. here we show that query response time for conjunctive boolean queries and for informal ranked queries can be dramatically reduced, at little cost in terms of storage, by the inclusion of an internal index in each inverted list. this method has been applied in a retrieval system for a collection of nearly two million short documents. our experimental results show that the selfindexing strategy adds less than 20% to the size of the inverted file, but, for boolean queries of 5{10 terms, can reduce processing time to under one fifth of the previous cost. similarly, ranked queries of 40{50 terms can be evaluated in as little as 25% of the previous time, with little or no loss of retrieval e ectiveness citee id:565 citee title:memory efficient ranking citee abstract:fast and effective ranking of a collection of documents with respect to a query requires several structures, including a vocabulary, inverted file entries, arrays of term weights and document lengths, an array of partial similarity accumulators, and address tables for inverted file entries and documentsfi of all of these structures, the array of document lengths and the array of accumulators are the components accessed most frequently in a ranked query, and it is crucial to acceptable surrounding text:such scaling is, however, incompatible with index compression: it does reduce memory requirements, but this reduction comes at the cost of a substantial growth in the size of the inverted file. a better method is to use low-precision approximations to the document weights, which can reduce each document length to around six bits without significantly a ecting retrieval e ectiveness or retrieval time [***]<2>. furthermore, in a multi-user environment the cost of storing the weights can be amortised over all active processes, since the weights are static and can be stored in shared memory influence:2 type:3 pair index:87 citer id:43 citer title:self-indexing inverted files for fast text retrieval citer abstract:query processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. here we show that query response time for conjunctive boolean queries and for informal ranked queries can be dramatically reduced, at little cost in terms of storage, by the inclusion of an internal index in each inverted list. this method has been applied in a retrieval system for a collection of nearly two million short documents. our experimental results show that the selfindexing strategy adds less than 20% to the size of the inverted file, but, for boolean queries of 5{10 terms, can reduce processing time to under one fifth of the previous cost. similarly, ranked queries of 40{50 terms can be evaluated in as little as 25% of the previous time, with little or no loss of retrieval e ectiveness citee id:758 citee title:multi-key access methods based on superimposed coding techniques citee abstract:both single-level and two-level indexed descriptor schemes for multikey retrieval are presented and compared. the descriptors are formed using superimposed coding techniques and stored using a bit-inversion technique. a fast-batch insertion algorithm for which the cost of forming the bit-inverted file is less than one disk access per record is presented. for large data files, it is shown that the two-level implementation is generally more efficient for queries with a small number of matching records. for queries that specify two or more values, there is a potential problem with the two-level implementation in that costs may accrue when blocks of records match the query but individual records within these blocks do not. one approach to overcoming this problem is to set bits in the descriptors based on pairs of indexed terms. this approach is presented and analyzed. surrounding text:and an uncompressed index would take a similar length of time. for comparison, it is also interesting to estimate the performance of another indexing method advocated for conjunctive boolean queriesthe bitsliced signature file [10, ***]<2>. in this case at least 10 bitslices of 212 kb each (one slice per query term, each of one bit per document in the collection) must be fetched and conjoined. moreover, a signature file index is typically several times larger than a compressed inverted file, even after the insertion of skips. multi-level signature file organisations reduce processing time by forming \super" signatures for blocks of records, so that record signatures for a block are investigated only if all query terms appear somewhere in the block [20, ***]<2>. while they reduce the amount of data transferred from disk during query evaluation, these methods do not reduce the size of the index influence:2 type:2 pair index:88 citer id:43 citer title:self-indexing inverted files for fast text retrieval citer abstract:query processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. here we show that query response time for conjunctive boolean queries and for informal ranked queries can be dramatically reduced, at little cost in terms of storage, by the inclusion of an internal index in each inverted list. this method has been applied in a retrieval system for a collection of nearly two million short documents. our experimental results show that the selfindexing strategy adds less than 20% to the size of the inverted file, but, for boolean queries of 5{10 terms, can reduce processing time to under one fifth of the previous cost. similarly, ranked queries of 40{50 terms can be evaluated in as little as 25% of the previous time, with little or no loss of retrieval e ectiveness citee id:566 citee title:introduction to modern information retrieval citee abstract:new technology now allows the design of sophisticated information retrieval system that can not only analyze, process and store, but can also retrieve specific resources matching a particular user��s needs. this clear and practical text relates the theory, techniques and tools critical to making information retrieval work. a completely revised second edition incorporates the latest developments in this rapidly expanding field, including multimedia information retrieval, user interfaces and digital libraries. chowdhury��s coverage is comprehensive, including classification, cataloging, subject indexing, ing, vocabulary control; cd-rom and online information retrieval; multimedia, hypertext and hypermedia; expert systems and natural language processing; user interface systems; internet, world wide web and digital library environments. illustrated with many examples and comprehensively referenced for an international audience, this is an ideal textbook for students of library and information studies and those professionals eager to advance their knowledge of the future of information. surrounding text:two main mechanisms for retrieving documents from these databases are in general use: boolean queries and informal ranked queries. a boolean querya set of query terms connected by the logical operators and, or, and notcan be used to identify the documents containing a given combination of terms, and is similar to the kind of query used on relational tables [***]<1>. ranking, on the other hand, is a process of matching an informal query to the documents and allocating scores to documents according to their degree of similarity to the query [29, ***]<1>. a boolean querya set of query terms connected by the logical operators and, or, and notcan be used to identify the documents containing a given combination of terms, and is similar to the kind of query used on relational tables [***]<1>. ranking, on the other hand, is a process of matching an informal query to the documents and allocating scores to documents according to their degree of similarity to the query [29, ***]<1>. a standard mechanism for supporting boolean queries is an inverted file [7, 11]<1> influence:2 type:3 pair index:89 citer id:43 citer title:self-indexing inverted files for fast text retrieval citer abstract:query processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. here we show that query response time for conjunctive boolean queries and for informal ranked queries can be dramatically reduced, at little cost in terms of storage, by the inclusion of an internal index in each inverted list. this method has been applied in a retrieval system for a collection of nearly two million short documents. our experimental results show that the selfindexing strategy adds less than 20% to the size of the inverted file, but, for boolean queries of 5{10 terms, can reduce processing time to under one fifth of the previous cost. similarly, ranked queries of 40{50 terms can be evaluated in as little as 25% of the previous time, with little or no loss of retrieval e ectiveness citee id:220 citee title:an efficient indexing technique for full-text database systems citee abstract:full-text database systems require an index to allow fast access to documents based on their content. we propose an inverted file indexing scheme based on compression. this scheme allows users to retrieve documents using words occurring in the documents, sequences of adjacent words, and statistical ranking techniques. the compression methods chosen ensure that the storage requirements are small and that dynamic update is straightforward. the only assumption that we make is that sufficient main memory is available to support an in-memory vocabulary; given this assumption, the method we describe requires at most one disc access per query term to identify answers to queries. surrounding text:ranking techniques can also be supported by inverted files. when the documents are stored in a database that is indexed by an inverted file several additional structures must be used if evaluation is to be fast [3, 18, ***]<1>. these include a weight for each word in the vocabulary. a table of mathematical symbols is provided at the end of the paper. 2 document databases in an inverted file document database, each distinct word in the database is held in a vocabulary [3, 11, 18, 24, 25, ***]<1>. the vocabulary entry for each word contains an address pointer to an inverted list (also known as a postings list ), a contiguous list of the documents containing the word influence:2 type:2 pair index:90 citer id:44 citer title:filtered document retrieval with frequency-sorted indexes citer abstract:ranking techniques are e ective at finding answers in document collections but can be expensive to evaluate. we propose an evaluation technique that uses early recognition of which documents are likely to be highly ranked to reduce costs; for our test data, queries are evaluated in 2% of the memory of the standard implementation without degradation in retrieval e ectiveness. cpu time and disk traffic can also be dramatically reduced by designing inverted indexes explicitly to support the technique. the principle of the index design is that inverted lists are sorted by decreasing within-document frequency rather than by document number, and this method experimentally reduces cpu time and disk traffic to around one third of the original requirement. we also show that frequency sorting can lead to a net reduction in index size, regardless of whether the index is compressed citee id:393 citee title:data compression in full-text retrieval systems citee abstract:describes compression methods for components of full-text systems such as text databases on cd-rom. topics discussed include storage media; structures for full-text retrieval, including indexes, inverted files, and bitmaps; compression tools; memory requirements during retrieval; and ranking and information retrieval surrounding text:by sorting inverted lists by decreasing within-document frequency, so that they are frequency-sorted, the identifiers of the interesting documents are brought to the start of the list, also yielding a reduction in disk traffic because only part of each inverted list must be retrieved. frequency-sorting can potentially have an adverse impact on index size, because index compression techniques rely on the small di erences between adjacent documents in longer inverted lists to achieve size reductions [***, 10]<2>. we show, however, that it is possible to use frequencypersin, zobel, sacks-davis 3 sorting to achieve a net reduction in index size, regardless of whether the index is compressed influence:3 type:2 pair index:91 citer id:44 citer title:filtered document retrieval with frequency-sorted indexes citer abstract:ranking techniques are e ective at finding answers in document collections but can be expensive to evaluate. we propose an evaluation technique that uses early recognition of which documents are likely to be highly ranked to reduce costs; for our test data, queries are evaluated in 2% of the memory of the standard implementation without degradation in retrieval e ectiveness. cpu time and disk traffic can also be dramatically reduced by designing inverted indexes explicitly to support the technique. the principle of the index design is that inverted lists are sorted by decreasing within-document frequency rather than by document number, and this method experimentally reduces cpu time and disk traffic to around one third of the original requirement. we also show that frequency sorting can lead to a net reduction in index size, regardless of whether the index is compressed citee id:563 citee title:optimisation of inverted vector searches citee abstract:a review of the use of inverted files for best match searching in information retrieval systems surrounding text:the vocabulary contains each term t in the database and the number ft of documents containing t. knowledge of ft allows the terms in a query to be processed in order of decreasing weight [***, 8]<1>, as is necessary for the technique we shall describe. there is one inverted list for each t, consisting of the identifiers of the documents containing the term and, with each identifier d, the within-document frequency fd influence:2 type:3 pair index:92 citer id:44 citer title:filtered document retrieval with frequency-sorted indexes citer abstract:ranking techniques are e ective at finding answers in document collections but can be expensive to evaluate. we propose an evaluation technique that uses early recognition of which documents are likely to be highly ranked to reduce costs; for our test data, queries are evaluated in 2% of the memory of the standard implementation without degradation in retrieval e ectiveness. cpu time and disk traffic can also be dramatically reduced by designing inverted indexes explicitly to support the technique. the principle of the index design is that inverted lists are sorted by decreasing within-document frequency rather than by document number, and this method experimentally reduces cpu time and disk traffic to around one third of the original requirement. we also show that frequency sorting can lead to a net reduction in index size, regardless of whether the index is compressed citee id:564 citee title:run-length encodings citee abstract:this article will discuss about how to implement a very simple compression algorithm rle. due to the fact that it's very easy to see, it can be a good introduction to programmers interested in data compression. rle is mainly used to compress runs of the same byte. however rle can also be good for the first stage of our bwt compressor, because we'll avoid too much time sorting strings that are equal due to the fact that there's a lot of runs with the same bytes in the data. surrounding text:4i : given that the number of documents containing a given term can be used to compute the average run-length, using a parameterised code the run-lengths can be efficiently compressed, as the run-lengths will conform to a known distribution with a known mean. for high-frequency terms, often only 1 or 2 bits are required to represent a run-length if coded using integer coding schemes such as those of elias [3]<3> or golomb [***]<3>. the fd influence:3 type:3 pair index:93 citer id:44 citer title:filtered document retrieval with frequency-sorted indexes citer abstract:ranking techniques are e ective at finding answers in document collections but can be expensive to evaluate. we propose an evaluation technique that uses early recognition of which documents are likely to be highly ranked to reduce costs; for our test data, queries are evaluated in 2% of the memory of the standard implementation without degradation in retrieval e ectiveness. cpu time and disk traffic can also be dramatically reduced by designing inverted indexes explicitly to support the technique. the principle of the index design is that inverted lists are sorted by decreasing within-document frequency rather than by document number, and this method experimentally reduces cpu time and disk traffic to around one third of the original requirement. we also show that frequency sorting can lead to a net reduction in index size, regardless of whether the index is compressed citee id:498 citee title:retrieving records from a gigabyte of text on a minicomputer using statistical ranking citee abstract:statistically based ranked retrieval of records using keywords provides many advantages over traditional boolean retrieval methods, especially for end users. this approach to retrieval, however, has not seen widespread use in large operational retrieval systems. to show the feasibility of this retrieval methodology, research was done to produce very fast search techniques using these ranking algorithms, and then to test the results against large databases with many end users. the results show not only response times on the order of 1 and l/2 seconds for 806 megabytes of text, but also very favorable user reaction. novice users were able to consistently obtain good search results after 5 minutes of training. additional work was done to devise new indexing techniques to create inverted files for large databases using a minicomputer. these techniques use no sorting, require a working space of only about 20% of the size of the input text, and produce indices that are about 14% of the input text size. surrounding text:processing of these values produces little increase in accuracy and is expensive, particularly in systems that use compression for inverted lists, since, to evaluate queries, large volumes of data have to be decompressed. there have been many attempts to improve the efficiency of ranked query evaluation [2, 4, ***, 8, 10]<2>. elimination of stop-wordsthat is, of very frequent words or closed-class words such as \and" and \of"is often used to reduce the number of uninformative terms processed. more sophisticated algorithms implement some dynamic stopping condition. the typical approach taken by these algorithms is to order terms in a query by decreasing weight, and then process terms in this order until some stopping condition is met [2, ***, 8, 10]<2>. mo at and zobel [10]<2> implemented the stopping condition by limiting the number of accumulators. the second version gave the same retrieval e ectiveness as a basic version that processed all inverted lists, and in conjunction with a modification to the index structure discussed below approximately halved processing time. harman and candela [***]<2> experimented with another pruning algorithm. they accumulated partial similarities given by all documents in all inverted lists (like the second algorithm by mo at and zobel) but limited the number of accumulators by setting a condition for the insertion of new documents into the set of relevant documents: their algorithm only considered those documents which contained terms with inverse document frequency more than a certain fraction of the maximum inverse document frequency of any term in the database. in other words, the thresholds provide a mechanism for tuning system load. thresholds have previously been used to decide whether to process or reject whole inverted lists [***]<2>, but not to decide whether to process or reject individual documents. the values of both thresholds for a term t are determined as a function of the accumulated partial similarity of the currently most relevant document smax. other term weighting systems the cosine measure as described in section 2 is not the only similarity measure. there are other similarity measures, for example those described by lucarella [8]<2> and harman and candela [***]<2>. we tested the robustness of document filtering by applying it to these similarity measures influence:1 type:2 pair index:94 citer id:44 citer title:filtered document retrieval with frequency-sorted indexes citer abstract:ranking techniques are e ective at finding answers in document collections but can be expensive to evaluate. we propose an evaluation technique that uses early recognition of which documents are likely to be highly ranked to reduce costs; for our test data, queries are evaluated in 2% of the memory of the standard implementation without degradation in retrieval e ectiveness. cpu time and disk traffic can also be dramatically reduced by designing inverted indexes explicitly to support the technique. the principle of the index design is that inverted lists are sorted by decreasing within-document frequency rather than by document number, and this method experimentally reduces cpu time and disk traffic to around one third of the original requirement. we also show that frequency sorting can lead to a net reduction in index size, regardless of whether the index is compressed citee id:325 citee title:coding for compression in full-text retrieval systems citee abstract:witten, bell and nevill (see ibid., p.23, 1991) have described compression models for use in full-text retrieval systems. the authors discuss other coding methods for use with the same models, and give results that show their scheme yielding virtually identical compression, and decoding more than forty times faster. one of the main features of their implementation is the complete absence of arithmetic coding; this, in part, is the reason for the high speed. the implementation is also particularly suited to slow devices such as cd-rom, in that the answering of a query requires one disk access for each term in the query and one disk access for each answer. all words and numbers are indexed, and there are no stop words. they have built two compressed databases surrounding text:in a database of n documents, the size of a document-sorted inverted list of p identifiers can be estimated as follows. the number of bits required to store the document identifiers is approximately [***]<3> bg(p) = p1:5 + log2 n p  : in addition an fd. t value must be stored for each document influence:3 type:3 pair index:95 citer id:44 citer title:filtered document retrieval with frequency-sorted indexes citer abstract:ranking techniques are e ective at finding answers in document collections but can be expensive to evaluate. we propose an evaluation technique that uses early recognition of which documents are likely to be highly ranked to reduce costs; for our test data, queries are evaluated in 2% of the memory of the standard implementation without degradation in retrieval e ectiveness. cpu time and disk traffic can also be dramatically reduced by designing inverted indexes explicitly to support the technique. the principle of the index design is that inverted lists are sorted by decreasing within-document frequency rather than by document number, and this method experimentally reduces cpu time and disk traffic to around one third of the original requirement. we also show that frequency sorting can lead to a net reduction in index size, regardless of whether the index is compressed citee id:43 citee title:self-indexing inverted files for fast text retrieval citee abstract:query processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. here we show that query response time for conjunctive boolean queries and for informal ranked queries can be dramatically reduced, at little cost in terms of storage, by the inclusion of an internal index in each inverted list. this method has been applied in a retrieval system for a collection of nearly two million short documents. our experimental results show that the selfindexing strategy adds less than 20% to the size of the inverted file, but, for boolean queries of 5{10 terms, can reduce processing time to under one fifth of the previous cost. similarly, ranked queries of 40{50 terms can be evaluated in as little as 25% of the previous time, with little or no loss of retrieval e ectiveness surrounding text:by sorting inverted lists by decreasing within-document frequency, so that they are frequency-sorted, the identifiers of the interesting documents are brought to the start of the list, also yielding a reduction in disk traffic because only part of each inverted list must be retrieved. frequency-sorting can potentially have an adverse impact on index size, because index compression techniques rely on the small di erences between adjacent documents in longer inverted lists to achieve size reductions [1, ***]<2>. we show, however, that it is possible to use frequencypersin, zobel, sacks-davis 3 sorting to achieve a net reduction in index size, regardless of whether the index is compressed. processing of these values produces little increase in accuracy and is expensive, particularly in systems that use compression for inverted lists, since, to evaluate queries, large volumes of data have to be decompressed. there have been many attempts to improve the efficiency of ranked query evaluation [2, 4, 6, 8, ***]<2>. elimination of stop-wordsthat is, of very frequent words or closed-class words such as \and" and \of"is often used to reduce the number of uninformative terms processed. more sophisticated algorithms implement some dynamic stopping condition. the typical approach taken by these algorithms is to order terms in a query by decreasing weight, and then process terms in this order until some stopping condition is met [2, 6, 8, ***]<2>. mo at and zobel [***]<2> implemented the stopping condition by limiting the number of accumulators. the typical approach taken by these algorithms is to order terms in a query by decreasing weight, and then process terms in this order until some stopping condition is met [2, 6, 8, ***]<2>. mo at and zobel [***]<2> implemented the stopping condition by limiting the number of accumulators. they tested two versions of the algorithm. interestingly, this phenomenon is consistent for di erent techniques and di erent document collections. for example, similar results were obtained by mo at and zobel in their experiments with an explicit limit on the number of accumulators [***]<2>, and in our own experiments with a di erent version of the cosine measure. the main saving yielded by this technique is a sharp reduction in the number of accumulators. the number of disk accesses is also reduced, since deciding whether to reject a term does not require a disk access. these results compare well to those of the \skipping" scheme of mo at and persin, zobel, sacks-davis 24 zobel [***]<2>, who on a larger database are only able to halve cpu time, and actually increase disk traffic slightly. however, their scheme is also applicable to boolean queries, for which they achieve much greater performance gains influence:1 type:2 pair index:96 citer id:44 citer title:filtered document retrieval with frequency-sorted indexes citer abstract:ranking techniques are e ective at finding answers in document collections but can be expensive to evaluate. we propose an evaluation technique that uses early recognition of which documents are likely to be highly ranked to reduce costs; for our test data, queries are evaluated in 2% of the memory of the standard implementation without degradation in retrieval e ectiveness. cpu time and disk traffic can also be dramatically reduced by designing inverted indexes explicitly to support the technique. the principle of the index design is that inverted lists are sorted by decreasing within-document frequency rather than by document number, and this method experimentally reduces cpu time and disk traffic to around one third of the original requirement. we also show that frequency sorting can lead to a net reduction in index size, regardless of whether the index is compressed citee id:565 citee title:memory efficient ranking citee abstract:fast and effective ranking of a collection of documents with respect to a query requires several structures, including a vocabulary, inverted file entries, arrays of term weights and document lengths, an array of partial similarity accumulators, and address tables for inverted file entries and documentsfi of all of these structures, the array of document lengths and the array of accumulators are the components accessed most frequently in a ranked query, and it is crucial to acceptable surrounding text:these values are query independent and need to be computed only once, at database creation time. and can be e ectively compacted and stored in a few bits each [***]<3>. the reason they are stored separately is to allow e ective compression of the inverted file influence:3 type:3 pair index:97 citer id:44 citer title:filtered document retrieval with frequency-sorted indexes citer abstract:ranking techniques are e ective at finding answers in document collections but can be expensive to evaluate. we propose an evaluation technique that uses early recognition of which documents are likely to be highly ranked to reduce costs; for our test data, queries are evaluated in 2% of the memory of the standard implementation without degradation in retrieval e ectiveness. cpu time and disk traffic can also be dramatically reduced by designing inverted indexes explicitly to support the technique. the principle of the index design is that inverted lists are sorted by decreasing within-document frequency rather than by document number, and this method experimentally reduces cpu time and disk traffic to around one third of the original requirement. we also show that frequency sorting can lead to a net reduction in index size, regardless of whether the index is compressed citee id:158 citee title:a review of the use of inverted files for best match searching in information retrieval systems citee abstract:the use of inverted files for the calculation of similarity coefficients and other types of matching function is discussed in the context of mechanised document retrieval systems. a critical evaluation is presented of a range of algorithms which have been described for the matching of documents with queries. particular attention is paid to the computational efficiency of the various procedures, and improved search heuristics are given in some cases. it is suggested that the algorithms could be implemented sufficiently efficiently to permit the provision of nearest neighbour searching as a standard retrieval option. surrounding text:t and stored elsewhere. several term weighting systems have been proposed and explored [4, ***, 14]<1>. we assign the weight to a term in a query or a document using the frequencymodified inverse document frequency, described by wx influence:3 type:2 pair index:98 citer id:44 citer title:filtered document retrieval with frequency-sorted indexes citer abstract:ranking techniques are e ective at finding answers in document collections but can be expensive to evaluate. we propose an evaluation technique that uses early recognition of which documents are likely to be highly ranked to reduce costs; for our test data, queries are evaluated in 2% of the memory of the standard implementation without degradation in retrieval e ectiveness. cpu time and disk traffic can also be dramatically reduced by designing inverted indexes explicitly to support the technique. the principle of the index design is that inverted lists are sorted by decreasing within-document frequency rather than by document number, and this method experimentally reduces cpu time and disk traffic to around one third of the original requirement. we also show that frequency sorting can lead to a net reduction in index size, regardless of whether the index is compressed citee id:566 citee title:introduction to modern information retrieval citee abstract:new technology now allows the design of sophisticated information retrieval system that can not only analyze, process and store, but can also retrieve specific resources matching a particular user��s needs. this clear and practical text relates the theory, techniques and tools critical to making information retrieval work. a completely revised second edition incorporates the latest developments in this rapidly expanding field, including multimedia information retrieval, user interfaces and digital libraries. chowdhury��s coverage is comprehensive, including classification, cataloging, subject indexing, ing, vocabulary control; cd-rom and online information retrieval; multimedia, hypertext and hypermedia; expert systems and natural language processing; user interface systems; internet, world wide web and digital library environments. illustrated with many examples and comprehensively referenced for an international audience, this is an ideal textbook for students of library and information studies and those professionals eager to advance their knowledge of the future of information. surrounding text:we also show that frequency sorting can lead to a net reduction in index size, regardless of whether the index is compressed. 1 introduction ranking is used to retrieve documents from a database and present them in order of estimated relevance to the user's query [13, ***]<1>. for the multi-gigabyte databases now available, ranking is considered the best option for data access: boolean queries require expert formulation, and techniques such as browsing are ine ective for the initial location of answers from among large numbers of documents. conclusions are presented in section 5. 2 ranked query evaluation the ranking technique we use to demonstrate our techniques is the cosine measure [13, ***]<1>. for this measure, the similarity of document d and query q is for practical purposes computed by cq. t and stored elsewhere. several term weighting systems have been proposed and explored [4, 12, ***]<1>. we assign the weight to a term in a query or a document using the frequencymodified inverse document frequency, described by wx. it is supposed that rare terms have high discrimination value and the presence of such a term in both a document and a query is a good indication that the document is relevant to the query. database structure we use inverted files to index documents [13, ***, 15]<1>. an inverted index for a document database typically has two components: a vocabulary and a set of inverted lists. overall, such inverted index compression techniques can reduce index size by a factor of six or more [1, 10]<1>. for a large document database indexed by an inverted file, the index can be used to simultaneously compute the cosine correlation between each document in a collection and the query as follows [4, 10, 13, ***]<1>. an accumulator is created for each document, either by initially allocating an accumulator for every document in the database or by dynamically adding an accumulator for a document when it is allocated non-zero similarity influence:2 type:3 pair index:99 citer id:44 citer title:filtered document retrieval with frequency-sorted indexes citer abstract:ranking techniques are e ective at finding answers in document collections but can be expensive to evaluate. we propose an evaluation technique that uses early recognition of which documents are likely to be highly ranked to reduce costs; for our test data, queries are evaluated in 2% of the memory of the standard implementation without degradation in retrieval e ectiveness. cpu time and disk traffic can also be dramatically reduced by designing inverted indexes explicitly to support the technique. the principle of the index design is that inverted lists are sorted by decreasing within-document frequency rather than by document number, and this method experimentally reduces cpu time and disk traffic to around one third of the original requirement. we also show that frequency sorting can lead to a net reduction in index size, regardless of whether the index is compressed citee id:220 citee title:an efficient indexing technique for full-text database systems citee abstract:full-text database systems require an index to allow fast access to documents based on their content. we propose an inverted file indexing scheme based on compression. this scheme allows users to retrieve documents using words occurring in the documents, sequences of adjacent words, and statistical ranking techniques. the compression methods chosen ensure that the storage requirements are small and that dynamic update is straightforward. the only assumption that we make is that sufficient main memory is available to support an in-memory vocabulary; given this assumption, the method we describe requires at most one disc access per query term to identify answers to queries. surrounding text:it is supposed that rare terms have high discrimination value and the presence of such a term in both a document and a query is a good indication that the document is relevant to the query. database structure we use inverted files to index documents [13, 14, ***]<1>. an inverted index for a document database typically has two components: a vocabulary and a set of inverted lists influence:2 type:1 pair index:100 citer id:47 citer title:compression of inverted indexes for fast query evaluation citer abstract:compression reduces both the size of indexes and the time needed to evaluate queries. in this paper, we revisit the compression of inverted lists of document postings that store the position and frequency of indexed terms, considering two approaches to improving retrieval efficiency: better implementation and better choice of integer compression schemes. first, we propose several simple optimisations to well-known integer compression schemes, and show experimentally that these lead to significant reductions in time. second, we explore the impact of choice of compression scheme on retrieval efficiency. in experiments on large collections of data, we show two surprising results: use of simple byte-aligned codes halves the query evaluation time compared to the most compact golomb-rice bitwise compression schemes; and, even when an index fits entirely in memory, byte-aligned codes result in faster query evaluation than does an uncompressed index, emphasising that the cost of transferring data from memory to the cpu cache is less for an appropriately compressed index than for an uncompressed index. moreover, byte-aligned schemes have only a modest space overhead: the most compact schemes result in indexes that are around 10% of the size of the collection, while a byte-aligned scheme is around 13%. we conclude that fast byte-aligned codes should be used to store integers in inverted lists citee id:43 citee title:self-indexing inverted files for fast text retrieval citee abstract:query processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. here we show that query response time for conjunctive boolean queries and for informal ranked queries can be dramatically reduced, at little cost in terms of storage, by the inclusion of an internal index in each inverted list. this method has been applied in a retrieval system for a collection of nearly two million short documents. our experimental results show that the selfindexing strategy adds less than 20% to the size of the inverted file, but, for boolean queries of 5{10 terms, can reduce processing time to under one fifth of the previous cost. similarly, ranked queries of 40{50 terms can be evaluated in as little as 25% of the previous time, with little or no loss of retrieval e ectiveness surrounding text:in this paper, we revisit compression schemes for the inverted list component of inverted indexes. there have been a great many reports of experiments on compression of indexes with bitwise compression schemes [***, 7, 12, 14, 15]<2>, which use an integral number of bits to represent each integer, usually with no restriction on the alignment of the integers to byte or machine-word boundaries. we consider several 1see http://www. t of the term t in the document d, and other factors. fourth, after processing part [1, ***]<2> or all of the lists, the accumulator scores are partially sorted to identify the most similar documents. last, for a typical search engine, document summaries of the top ten documents are generated or retrieved and shown to the user influence:1 type:2 pair index:101 citer id:47 citer title:compression of inverted indexes for fast query evaluation citer abstract:compression reduces both the size of indexes and the time needed to evaluate queries. in this paper, we revisit the compression of inverted lists of document postings that store the position and frequency of indexed terms, considering two approaches to improving retrieval efficiency: better implementation and better choice of integer compression schemes. first, we propose several simple optimisations to well-known integer compression schemes, and show experimentally that these lead to significant reductions in time. second, we explore the impact of choice of compression scheme on retrieval efficiency. in experiments on large collections of data, we show two surprising results: use of simple byte-aligned codes halves the query evaluation time compared to the most compact golomb-rice bitwise compression schemes; and, even when an index fits entirely in memory, byte-aligned codes result in faster query evaluation than does an uncompressed index, emphasising that the cost of transferring data from memory to the cpu cache is less for an appropriately compressed index than for an uncompressed index. moreover, byte-aligned schemes have only a modest space overhead: the most compact schemes result in indexes that are around 10% of the size of the collection, while a byte-aligned scheme is around 13%. we conclude that fast byte-aligned codes should be used to store integers in inverted lists citee id:44 citee title:filtered document retrieval with frequency-sorted indexes citee abstract:ranking techniques are e ective at finding answers in document collections but can be expensive to evaluate. we propose an evaluation technique that uses early recognition of which documents are likely to be highly ranked to reduce costs; for our test data, queries are evaluated in 2% of the memory of the standard implementation without degradation in retrieval e ectiveness. cpu time and disk traffic can also be dramatically reduced by designing inverted indexes explicitly to support the technique. the principle of the index design is that inverted lists are sorted by decreasing within-document frequency rather than by document number, and this method experimentally reduces cpu time and disk traffic to around one third of the original requirement. we also show that frequency sorting can lead to a net reduction in index size, regardless of whether the index is compressed surrounding text:other arrangements of the postings in lists are useful when lists are not necessarily completely processed in response to a query. for example, in frequency-sorted indexes [8, ***]<2> postings are ordered by fd. t, and in impact-ordered indexes the postings are ordered by quantised weights [1]<2> influence:3 type:2 pair index:102 citer id:47 citer title:compression of inverted indexes for fast query evaluation citer abstract:compression reduces both the size of indexes and the time needed to evaluate queries. in this paper, we revisit the compression of inverted lists of document postings that store the position and frequency of indexed terms, considering two approaches to improving retrieval efficiency: better implementation and better choice of integer compression schemes. first, we propose several simple optimisations to well-known integer compression schemes, and show experimentally that these lead to significant reductions in time. second, we explore the impact of choice of compression scheme on retrieval efficiency. in experiments on large collections of data, we show two surprising results: use of simple byte-aligned codes halves the query evaluation time compared to the most compact golomb-rice bitwise compression schemes; and, even when an index fits entirely in memory, byte-aligned codes result in faster query evaluation than does an uncompressed index, emphasising that the cost of transferring data from memory to the cpu cache is less for an appropriately compressed index than for an uncompressed index. moreover, byte-aligned schemes have only a modest space overhead: the most compact schemes result in indexes that are around 10% of the size of the collection, while a byte-aligned scheme is around 13%. we conclude that fast byte-aligned codes should be used to store integers in inverted lists citee id:342 citee title:searching the web: the public and their queries citee abstract:in studying actual web searching by the public at large, we analyzed over one million web queries by users of the excite search enginefi we found that most people use few search terms, few modified queries, view few web pages, and rarely use advanced search featuresfi a small number of search terms are used with high frequency, and a great many terms are unique; the language of web queries is distinctivefi queries about recreation and en-tertainment rank highestfi findings are compared to data from two other large studies of web queriesfi this study provides an insight into the public practices and choices in web searchingfi surrounding text:a 500 mb collection is used, and results are averaged over 10,000 queries. schemes with streams of 10,000 or 25,000 queries extracted from a query log [***]<3>, where the frequency distribution of query terms leads to beneficial use of caching. the other level at which caching takes place is the retention in the cpu cache of small blocks of data, typically of 128 bytes, recently accessed from memory. small collection figure 2 shows the relative performance of the integer compression schemes we have described for storing o sets, on a 500 mb collection of 94,802 web documents drawn from the trec web track data [4]<3>. timing results are averaged over 10,000 ranked queries drawn from an excite search engine query log [***]<3>. the index contains 703. exactly the same index types are tried as for the experiment above. a 20 gb collection of 4,014,894 web documents drawn from the trec web track data [4]<3> is used and timing results are averaged over 25,000 ranked queries drawn from an excite search engine query log [***]<3>. the index contains 9,574,703 terms influence:2 type:3 pair index:103 citer id:47 citer title:compression of inverted indexes for fast query evaluation citer abstract:compression reduces both the size of indexes and the time needed to evaluate queries. in this paper, we revisit the compression of inverted lists of document postings that store the position and frequency of indexed terms, considering two approaches to improving retrieval efficiency: better implementation and better choice of integer compression schemes. first, we propose several simple optimisations to well-known integer compression schemes, and show experimentally that these lead to significant reductions in time. second, we explore the impact of choice of compression scheme on retrieval efficiency. in experiments on large collections of data, we show two surprising results: use of simple byte-aligned codes halves the query evaluation time compared to the most compact golomb-rice bitwise compression schemes; and, even when an index fits entirely in memory, byte-aligned codes result in faster query evaluation than does an uncompressed index, emphasising that the cost of transferring data from memory to the cpu cache is less for an appropriately compressed index than for an uncompressed index. moreover, byte-aligned schemes have only a modest space overhead: the most compact schemes result in indexes that are around 10% of the size of the collection, while a byte-aligned scheme is around 13%. we conclude that fast byte-aligned codes should be used to store integers in inverted lists citee id:343 citee title:managing gigabytes: compressing and indexing documents and images citee abstract:in this fully updated second edition of the highly acclaimed managing gigabytes, authors witten, moffat, and bell continue to provide unparalleled coverage of state-of-the-art techniques for compressing and indexing data. whatever your field, if you work with large quantities of information, this book is essential reading--an authoritative theoretical resource and a practical guide to meeting the toughest storage and access challenges. it covers the latest developments in compression and indexing and their application on the web and in digital libraries. it also details dozens of powerful techniques supported by mg, the authors' own system for compressing, storing, and retrieving text, images, and textual images. mg's source code is freely available on the web. surrounding text:moreover, the increasing availability and a ordability of large storage devices suggests that the amount of data stored online will continue to grow. inverted indexes are used to evaluate queries in all practical search engines [***]<1>. compression of these indexes has three major benefits for performance. 2. inverted indexes an inverted index consists of two major components: the vocabulary of termsfor example the wordsfrom the collection, and inverted lists, which are vectors that contain information about the occurrence of the terms [***]<1>. in a basic implementation, for each term t there is an inverted list that contains postings < fd. 3. compressing inverted indexes special-purpose integer compression schemes o er both fast decoding and compact storage of inverted lists [13, ***]<1>. in this section, we consider how inverted lists are compressed and stored on disk. for compression of inverted lists, a value of b is required. witten et al$ [***]<1> report that for cases where the probability of any particular integer value occurring is smallwhich is the usual case for document numbers d and o sets othen b can be calculated as: b = 0:69 fi mean(k) for each inverted list, the mean value of document numbers d can be approximated as k = n=ft where n is the number of documents in the collection and ft is the number of postings in the inverted list for term t [***]<1>. this approach can also be extended to o sets: the mean value of o sets o for an inverted list posting can be approximated as k = ld=fd. for compression of inverted lists, a value of b is required. witten et al$ [***]<1> report that for cases where the probability of any particular integer value occurring is smallwhich is the usual case for document numbers d and o sets othen b can be calculated as: b = 0:69 fi mean(k) for each inverted list, the mean value of document numbers d can be approximated as k = n=ft where n is the number of documents in the collection and ft is the number of postings in the inverted list for term t [***]<1>. this approach can also be extended to o sets: the mean value of o sets o for an inverted list posting can be approximated as k = ld=fd. t, and a sequence of o sets o. a standard choice is to use golomb codes for document numbers, gamma codes for frequencies, and delta codes for o sets [***]<1>. (we explore the properties of this choice later. a third optimisation is an approach we call signature blocks, which are a variant of skipping. skipping is the approach of storing additional integers in inverted lists that indicate how much data can be skipped without any processing [***]<1>. skipping has the disadvantage of an additional storage space requirement, but has been shown to o er substantial speed improvements [***]<1>. skipping is the approach of storing additional integers in inverted lists that indicate how much data can be skipped without any processing [***]<1>. skipping has the disadvantage of an additional storage space requirement, but has been shown to o er substantial speed improvements [***]<1>. a signature block is an eight-bit block that stores the ag bits of up to eight blocks that follow influence:2 type:1 pair index:104 citer id:52 citer title:a fast and high quality multilevel scheme for partitioning irregular graphs citer abstract:recently, a number of researchers have investigated a class of graph partitioning algorithms that reduce the size of the graph by collapsing vertices and edges, partition the smaller graph, and then uncoarsen it to construct a partition for the original graph . from the early work it was clear that multilevel techniques held great promise; however, it was not known if they can be made to consistently produce high quality partitions for graphs arising in a wide range of application domains. we investigate the e ectiveness of many di erent choices for all three phases: coarsening, partition of the coarsest graph, and refinement. in particular, we present a new coarsening heuristic (called heavy-edge heuristic) for which the size of the partition of the coarse graph is within a small factor of the size of the final partition obtained after multilevel refinement. we also present a much faster variation of the kernighan{lin (kl) algorithm for refining during uncoarsening. we test our scheme on a large number of graphs arising in various domains including finite element methods, linear programming, vlsi, and transportation. our experiments show that our scheme produces partitions that are consistently better than those produced by spectral partitioning schemes in substantially smaller time. also, when our scheme is used to compute fill-reducing orderings for sparse matrices, it produces orderings that have substantially smaller fill than the widely used multiple minimum degree algorithm citee id:53 citee title:a fast multilevel implementation of recursive spectral bisection for partitioning unstructured problems citee abstract:unstructured meshes are used in many large-scale scientific and engineering problems, including finite-volume methods for computational fluid dynamics and finite-element methods for structural analysis. if unstructured problems such as these are to be solved on distributed-memory parallel computers, their data structures must be partitioned and distributed across processors; if they are to be solved efficiently, the partitioning must maximize load balance and minimize interprocessor... ( surrounding text:however, these methods are very expensive since they require the computation of the eigenvector corresponding to the second smallest eigenvalue (fiedler vector). execution time of the spectral methods can be reduced if computation of the fiedler vector is done by using a multilevel algorithm [***]<2>. this multilevel spectral bisection (msb) algorithm usually manages to speed up the spectral partitioning methods by an order of magnitude without any loss in the quality of the edge-cut. comparison with other partitioning schemes. the msb [***]<2> has been shown to be an e ective method for partitioning unstructured problems in a variety of applications. the msb algorithm coarsens the graph down to a few hundred vertices using random matching. in the absence of extensive data, we could not have done any better anyway. in table 9 we show three di erent variations of spectral partitioning [45, 47, 26, ***]<2>, the multilevel partitioning described in this paper, the levelized nested dissection [11]<2>, the kl partition [31]<2>, the coordinate nested dissection (cnd) [23]<2>, two variations of the inertial partition [38, 25]<2>, and two variants of geometric partitioning [37, 36, 15]<2>. for each graph partitioning algorithm, table 9 shows a number of characteristics influence:2 type:1 pair index:105 citer id:52 citer title:a fast and high quality multilevel scheme for partitioning irregular graphs citer abstract:recently, a number of researchers have investigated a class of graph partitioning algorithms that reduce the size of the graph by collapsing vertices and edges, partition the smaller graph, and then uncoarsen it to construct a partition for the original graph . from the early work it was clear that multilevel techniques held great promise; however, it was not known if they can be made to consistently produce high quality partitions for graphs arising in a wide range of application domains. we investigate the e ectiveness of many di erent choices for all three phases: coarsening, partition of the coarsest graph, and refinement. in particular, we present a new coarsening heuristic (called heavy-edge heuristic) for which the size of the partition of the coarse graph is within a small factor of the size of the final partition obtained after multilevel refinement. we also present a much faster variation of the kernighan{lin (kl) algorithm for refining during uncoarsening. we test our scheme on a large number of graphs arising in various domains including finite element methods, linear programming, vlsi, and transportation. our experiments show that our scheme produces partitions that are consistently better than those produced by spectral partitioning schemes in substantially smaller time. also, when our scheme is used to compute fill-reducing orderings for sparse matrices, it produces orderings that have substantially smaller fill than the widely used multiple minimum degree algorithm citee id:54 citee title:an algorithm for partitioning the nodes of a graph citee abstract:let $g = \{ n,e \}$ be an undirected graph having nodes $n$ and edges $e$. we consider the problem of partitioning $n$ into $k$ disjoint subsets $n_1 , \cdots ,n_k $ of given sizes $m_1 , \cdots ,m_k $, respectively, in such a way that the number of edges in $e$ that connect different subsets is minimal. we obtain a heuristic solution from the solution of a linear programming transportation problem. surrounding text:since during coarsening the weights of the vertices and edges of the coarser graph were set to re��ect the weights of the vertices and edges of the finer graph, gm contains sufficient information to intelligently enforce the balanced partition and the small edge-cut requirements. a partition of gm can be obtained using various algorithms such as (a) spectral bisection [45, 47, 2, 24]<1>, (b) geometric bisection [37, 36]<1> (if coordinates are available),2 and (c) combinatorial methods [31, ***, 11, 12, 17, 5, 33, 21]<1>. since the size of the coarser graph gm is small (i influence:1 type:2 pair index:106 citer id:52 citer title:a fast and high quality multilevel scheme for partitioning irregular graphs citer abstract:recently, a number of researchers have investigated a class of graph partitioning algorithms that reduce the size of the graph by collapsing vertices and edges, partition the smaller graph, and then uncoarsen it to construct a partition for the original graph . from the early work it was clear that multilevel techniques held great promise; however, it was not known if they can be made to consistently produce high quality partitions for graphs arising in a wide range of application domains. we investigate the e ectiveness of many di erent choices for all three phases: coarsening, partition of the coarsest graph, and refinement. in particular, we present a new coarsening heuristic (called heavy-edge heuristic) for which the size of the partition of the coarse graph is within a small factor of the size of the final partition obtained after multilevel refinement. we also present a much faster variation of the kernighan{lin (kl) algorithm for refining during uncoarsening. we test our scheme on a large number of graphs arising in various domains including finite element methods, linear programming, vlsi, and transportation. our experiments show that our scheme produces partitions that are consistently better than those produced by spectral partitioning schemes in substantially smaller time. also, when our scheme is used to compute fill-reducing orderings for sparse matrices, it produces orderings that have substantially smaller fill than the widely used multiple minimum degree algorithm citee id:55 citee title:graph bisection algorithms with good average case behavior citee abstract:in the paper, we describe a polynomial time algorithm that, for every input graph, either outputs the minimum bisection of the graph or halts without output. more importantly, we show that the algorithm chooses the former course with high probability for many natural classes of graphs. in particular, for every fixedd�r3, all sufficiently largen and allb=o(n 1?1/), the algorithm finds the minimum bisection for almost alld-regular labelled simple graphs with 2n nodes and bisection widthb. for example, the algorithm succeeds for almost all 5-regular graphs with 2n nodes and bisection widtho(n 2/3). the algorithm differs from other graph bisection heuristics (as well as from many heuristics for other np-complete problems) in several respects. surrounding text:since during coarsening the weights of the vertices and edges of the coarser graph were set to re��ect the weights of the vertices and edges of the finer graph, gm contains sufficient information to intelligently enforce the balanced partition and the small edge-cut requirements. a partition of gm can be obtained using various algorithms such as (a) spectral bisection [45, 47, 2, 24]<1>, (b) geometric bisection [37, 36]<1> (if coordinates are available),2 and (c) combinatorial methods [31, 3, 11, 12, 17, ***, 33, 21]<1>. since the size of the coarser graph gm is small (i influence:2 type:2 pair index:107 citer id:52 citer title:a fast and high quality multilevel scheme for partitioning irregular graphs citer abstract:recently, a number of researchers have investigated a class of graph partitioning algorithms that reduce the size of the graph by collapsing vertices and edges, partition the smaller graph, and then uncoarsen it to construct a partition for the original graph . from the early work it was clear that multilevel techniques held great promise; however, it was not known if they can be made to consistently produce high quality partitions for graphs arising in a wide range of application domains. we investigate the e ectiveness of many di erent choices for all three phases: coarsening, partition of the coarsest graph, and refinement. in particular, we present a new coarsening heuristic (called heavy-edge heuristic) for which the size of the partition of the coarse graph is within a small factor of the size of the final partition obtained after multilevel refinement. we also present a much faster variation of the kernighan{lin (kl) algorithm for refining during uncoarsening. we test our scheme on a large number of graphs arising in various domains including finite element methods, linear programming, vlsi, and transportation. our experiments show that our scheme produces partitions that are consistently better than those produced by spectral partitioning schemes in substantially smaller time. also, when our scheme is used to compute fill-reducing orderings for sparse matrices, it produces orderings that have substantially smaller fill than the widely used multiple minimum degree algorithm citee id:56 citee title:geometric spectral partitioning citee abstract:we investigate a new method for partitioning a graph into two equal-sized pieces with few connecting edgesfiwe combine ideas from two recently suggested partitioning algorithms, spectral bisection (which uses an eigenvectorof a matrix associated with the graph) and geometric bisection (which applies to graphs that are meshesin euclidean space)fi the new method does not require geometric coordinates, and it produces partitions that areoften better than either the spectral or geometric onesfi surrounding text:, linear programming, vlsi), there is no geometry associated with the graph. recently, an algorithm has been proposed to compute coordinates for graph vertices [***]<2> by using spectral methods. but these methods are much more expensive and dominate the overall time taken by the graph partitioning algorithm influence:1 type:2 pair index:108 citer id:52 citer title:a fast and high quality multilevel scheme for partitioning irregular graphs citer abstract:recently, a number of researchers have investigated a class of graph partitioning algorithms that reduce the size of the graph by collapsing vertices and edges, partition the smaller graph, and then uncoarsen it to construct a partition for the original graph . from the early work it was clear that multilevel techniques held great promise; however, it was not known if they can be made to consistently produce high quality partitions for graphs arising in a wide range of application domains. we investigate the e ectiveness of many di erent choices for all three phases: coarsening, partition of the coarsest graph, and refinement. in particular, we present a new coarsening heuristic (called heavy-edge heuristic) for which the size of the partition of the coarse graph is within a small factor of the size of the final partition obtained after multilevel refinement. we also present a much faster variation of the kernighan{lin (kl) algorithm for refining during uncoarsening. we test our scheme on a large number of graphs arising in various domains including finite element methods, linear programming, vlsi, and transportation. our experiments show that our scheme produces partitions that are consistently better than those produced by spectral partitioning schemes in substantially smaller time. also, when our scheme is used to compute fill-reducing orderings for sparse matrices, it produces orderings that have substantially smaller fill than the widely used multiple minimum degree algorithm citee id:57 citee title:parallel algorithms for dynamically partitioning unstructured grids citee abstract:grid partitioning is the method of choice for decomposing a wide variety of computationalproblems into naturally parallel piecesfi in problems where computational loadon the grid or the grid itself changes as the simulation progresses, the ability to repartitiondynamically and in parallel is attractive for achieving higher performancefi wedescribe three algorithms suitable for parallel dynamic load--balancing which attemptto partition unstructured grids so that computational load is surrounding text:schemes that rely on coordinate information do not seem to have this limitation, and in principle it appears that these schemes can be parallelized quite e ectively. however, all available parallel formulation of these schemes [23, ***]<2> obtained no better speedup than obtained for the multilevel scheme in [30]<2>. 9 influence:2 type:2 pair index:109 citer id:52 citer title:a fast and high quality multilevel scheme for partitioning irregular graphs citer abstract:recently, a number of researchers have investigated a class of graph partitioning algorithms that reduce the size of the graph by collapsing vertices and edges, partition the smaller graph, and then uncoarsen it to construct a partition for the original graph . from the early work it was clear that multilevel techniques held great promise; however, it was not known if they can be made to consistently produce high quality partitions for graphs arising in a wide range of application domains. we investigate the e ectiveness of many di erent choices for all three phases: coarsening, partition of the coarsest graph, and refinement. in particular, we present a new coarsening heuristic (called heavy-edge heuristic) for which the size of the partition of the coarse graph is within a small factor of the size of the final partition obtained after multilevel refinement. we also present a much faster variation of the kernighan{lin (kl) algorithm for refining during uncoarsening. we test our scheme on a large number of graphs arising in various domains including finite element methods, linear programming, vlsi, and transportation. our experiments show that our scheme produces partitions that are consistently better than those produced by spectral partitioning schemes in substantially smaller time. also, when our scheme is used to compute fill-reducing orderings for sparse matrices, it produces orderings that have substantially smaller fill than the widely used multiple minimum degree algorithm citee id:58 citee title:a linear time heuristic for improving network partitions citee abstract:the fiduccia-matteyses min-cut heuristic provides an efficient solution to theproblem of separating a network of vertices into 2 separate partitions in an effort tominimize the number of nets which contain nodes in each partition. the heuristic isdesigned to specifically handle large and complex networks which contain multi-terminaland also weighted cells. the worst case computation time of this heuristic increaseslinearly with the overall size of the network. additionally, in practice in can be seen thatthe number of iterations that is typically required for the cutest to converge to theminimum value is typically very small. the key factor in obtaining this linear-timebehavior is due to the fact that the algorithm moves one node at a time between partitionsin an attempt to reduce the current cut-set by the maximum possible value. we must alsonote that at times, the maximum cut-set reduction will be negative; however, at this pointwe proceed with the algorithm as this allows the opportunity escape from a local minima,should we currently be in one. upon the movement of a node, appropriate nodesconnected to the moved cell are updated to represent the move. the use of simple, yetefficient data structures allows us the ability to avoid redundant searching for best cell tobe moved, and similarly also prevents an excess and unnecessary updates to the neighborcells that need to be updated. thus the data structures themselves, assist in adding to thealready efficient nature of the algorithm. the heuristic also contains a balance constraintwhich allows the user to retain control over the size of the created partitions. surrounding text:368 george karypis and vipin kumar developed. one such algorithm is by fiduccia and mattheyses [***]<2> that reduces the complexity to o(jej) by using appropriate data structures. the kl algorithm finds locally optimal partitions when it starts with a good initial partition and when the average degree of the graph is large [4]<3>. after moving v, v is marked so it will not be considered again in the same iteration, and the gains of the vertices adjacent to v are updated to re��ect the change in the partition. the original kl algorithm [***]<2> continues moving vertices between the partitions until all the vertices have been moved. however, in our implementation, the kl algorithm terminates when the edge-cut does not decrease after x vertex moves. the trial with the smaller edge-cut is selected as the partition. this partition is then further refined by using it as the input to the kl 3the fm algorithm [***]<2> is slightly di erent than that originally developed by kernighan and lin [31]<2>. the di erence is that in each step, the fm algorithm moves a single vertex from one part to the other whereas the kl algorithm selects a pair of vertices, one from each part, and moves them influence:2 type:1 pair index:110 citer id:52 citer title:a fast and high quality multilevel scheme for partitioning irregular graphs citer abstract:recently, a number of researchers have investigated a class of graph partitioning algorithms that reduce the size of the graph by collapsing vertices and edges, partition the smaller graph, and then uncoarsen it to construct a partition for the original graph . from the early work it was clear that multilevel techniques held great promise; however, it was not known if they can be made to consistently produce high quality partitions for graphs arising in a wide range of application domains. we investigate the e ectiveness of many di erent choices for all three phases: coarsening, partition of the coarsest graph, and refinement. in particular, we present a new coarsening heuristic (called heavy-edge heuristic) for which the size of the partition of the coarse graph is within a small factor of the size of the final partition obtained after multilevel refinement. we also present a much faster variation of the kernighan{lin (kl) algorithm for refining during uncoarsening. we test our scheme on a large number of graphs arising in various domains including finite element methods, linear programming, vlsi, and transportation. our experiments show that our scheme produces partitions that are consistently better than those produced by spectral partitioning schemes in substantially smaller time. also, when our scheme is used to compute fill-reducing orderings for sparse matrices, it produces orderings that have substantially smaller fill than the widely used multiple minimum degree algorithm citee id:59 citee title:finding clusters in vlsi circuits citee abstract:circuit partitioning plays a fundamental role in hierarchical layout systems. identifying the strongly connected subcircuits, the clusters, of the logic can significantly reduce the delay of the circuit and the total interconnection length. finding such a cluster partition however, is np-complete. the authors propose a fast heuristic algorithm based on a simple, local criterion. they are able to prove that for highly structured circuits the clusters found by this algorithm correspond with high probability to the `natural' clusters. an application to large scale real world circuits shows that by this method the number of nets cut is reduced by up to 46% compared to the standard mincut approach surrounding text:another class of graph partitioning algorithms reduces the size of the graph (i$e$, coarsen the graph) by collapsing vertices and edges, partitions the smaller graph, and then uncoarsens it to construct a partition for the original graph. these are called multilevel graph partitioning schemes [4, 7, 19, 20, 26, ***, 43]<2>. some researchers investigated multilevel schemes primarily to decrease the partitioning time, at the cost of somewhat worse partition quality [43]<2>. some researchers investigated multilevel schemes primarily to decrease the partitioning time, at the cost of somewhat worse partition quality [43]<2>. recently, a number of multilevel algorithms have been proposed [4, 26, 7, 20, ***]<2> that further refine the partition during the uncoarsening phase. these schemes tend to give good partitions at a reasonable cost influence:2 type:2 pair index:111 citer id:52 citer title:a fast and high quality multilevel scheme for partitioning irregular graphs citer abstract:recently, a number of researchers have investigated a class of graph partitioning algorithms that reduce the size of the graph by collapsing vertices and edges, partition the smaller graph, and then uncoarsen it to construct a partition for the original graph . from the early work it was clear that multilevel techniques held great promise; however, it was not known if they can be made to consistently produce high quality partitions for graphs arising in a wide range of application domains. we investigate the e ectiveness of many di erent choices for all three phases: coarsening, partition of the coarsest graph, and refinement. in particular, we present a new coarsening heuristic (called heavy-edge heuristic) for which the size of the partition of the coarse graph is within a small factor of the size of the final partition obtained after multilevel refinement. we also present a much faster variation of the kernighan{lin (kl) algorithm for refining during uncoarsening. we test our scheme on a large number of graphs arising in various domains including finite element methods, linear programming, vlsi, and transportation. our experiments show that our scheme produces partitions that are consistently better than those produced by spectral partitioning schemes in substantially smaller time. also, when our scheme is used to compute fill-reducing orderings for sparse matrices, it produces orderings that have substantially smaller fill than the widely used multiple minimum degree algorithm citee id:60 citee title:the evolution of the minimum degree ordering algorithm citee abstract:over the past fifteen years, the implementation of the minimum degree algorithm has received much study, and many important enhancements have been made to it. in this article, we describe these various enhancements, trace their historical development, and provide some experiments showing how very effective they are in improving the execution time of the algorithm. we also present a shortcoming that exists in all of the widely used implementations of the algorithm, namely, that the quality of the ordering provided by the implementations is surprisingly sensitive to the initial ordering. for example. changing the input ordering can lead to an increase (or decrease) of as much as a factor of three in the cost of the subsequent numerical factorization. this sensitivity is caused by the lack of an effective tie-breaking strategy. and our experiments illustrate the importance of developing such a strategy. surrounding text:on a parallel computer, a fill-reducing ordering, besides minimizing the operation count, should also increase the degree of concurrency that can be exploited during factorization. in general, nested dissectionbased orderings exhibit more concurrency during factorization than minimum degree orderings [***, 35]<2>. the minimum degree [***]<2> ordering heuristic is the most widely used fill-reducing algorithm that is used to order sparse matrices for factorization on serial computers. in general, nested dissectionbased orderings exhibit more concurrency during factorization than minimum degree orderings [***, 35]<2>. the minimum degree [***]<2> ordering heuristic is the most widely used fill-reducing algorithm that is used to order sparse matrices for factorization on serial computers. the minimum degree algorithm has been found to produce very good orderings influence:1 type:1 pair index:112 citer id:52 citer title:a fast and high quality multilevel scheme for partitioning irregular graphs citer abstract:recently, a number of researchers have investigated a class of graph partitioning algorithms that reduce the size of the graph by collapsing vertices and edges, partition the smaller graph, and then uncoarsen it to construct a partition for the original graph . from the early work it was clear that multilevel techniques held great promise; however, it was not known if they can be made to consistently produce high quality partitions for graphs arising in a wide range of application domains. we investigate the e ectiveness of many di erent choices for all three phases: coarsening, partition of the coarsest graph, and refinement. in particular, we present a new coarsening heuristic (called heavy-edge heuristic) for which the size of the partition of the coarse graph is within a small factor of the size of the final partition obtained after multilevel refinement. we also present a much faster variation of the kernighan{lin (kl) algorithm for refining during uncoarsening. we test our scheme on a large number of graphs arising in various domains including finite element methods, linear programming, vlsi, and transportation. our experiments show that our scheme produces partitions that are consistently better than those produced by spectral partitioning schemes in substantially smaller time. also, when our scheme is used to compute fill-reducing orderings for sparse matrices, it produces orderings that have substantially smaller fill than the widely used multiple minimum degree algorithm citee id:61 citee title:geometric mesh partitioning: implementation and experiments citee abstract:we investigate a method of dividing an irregular mesh into equal-sized pieces with few interconnecting edgesfi the method"s novel feature is that it exploits the geometric coordinates of the mesh verticesfi it is based on theoretical work of miller, teng, thurston, and vavasis, who showed that certain classes of "well-shaped" finite element meshes have good separatorsfi the geometric method is quite simple to implement: we describe a matlab code for it in some detailfi the method is also quite surrounding text:however, due to the randomized nature of these algorithms, multiple trials are often required (5 to 50) to obtain solutions that are comparable in quality with spectral methods. multiple trials do increase the time [***]<2>, but the overall runtime is still substantially lower than the time required by the spectral methods. geometric graph partitioning algorithms are applicable only if coordinates are available for the vertices of the graph. in the absence of extensive data, we could not have done any better anyway. in table 9 we show three di erent variations of spectral partitioning [45, 47, 26, 2]<2>, the multilevel partitioning described in this paper, the levelized nested dissection [11]<2>, the kl partition [31]<2>, the coordinate nested dissection (cnd) [23]<2>, two variations of the inertial partition [38, 25]<2>, and two variants of geometric partitioning [37, 36, ***]<2>. for each graph partitioning algorithm, table 9 shows a number of characteristics. 5 the quality of the other schemes is worse than the above three by various degrees. note that for both kl 5this conclusion is an extrapolation of the results presented in [***]<2> where it was shown that the geometric partitioning with 30 trials (default geometric) produces partitions comparable with that of msb without kl refinement. 386 george karypis and vipin kumar table 9 characteristics of various graph partitioning algorithms influence:1 type:2 pair index:113 citer id:52 citer title:a fast and high quality multilevel scheme for partitioning irregular graphs citer abstract:recently, a number of researchers have investigated a class of graph partitioning algorithms that reduce the size of the graph by collapsing vertices and edges, partition the smaller graph, and then uncoarsen it to construct a partition for the original graph . from the early work it was clear that multilevel techniques held great promise; however, it was not known if they can be made to consistently produce high quality partitions for graphs arising in a wide range of application domains. we investigate the e ectiveness of many di erent choices for all three phases: coarsening, partition of the coarsest graph, and refinement. in particular, we present a new coarsening heuristic (called heavy-edge heuristic) for which the size of the partition of the coarse graph is within a small factor of the size of the final partition obtained after multilevel refinement. we also present a much faster variation of the kernighan{lin (kl) algorithm for refining during uncoarsening. we test our scheme on a large number of graphs arising in various domains including finite element methods, linear programming, vlsi, and transportation. our experiments show that our scheme produces partitions that are consistently better than those produced by spectral partitioning schemes in substantially smaller time. also, when our scheme is used to compute fill-reducing orderings for sparse matrices, it produces orderings that have substantially smaller fill than the widely used multiple minimum degree algorithm citee id:62 citee title:heuristic algorithms for automatic graph partitioning citee abstract:practical implementations of the finite element method on distributed memory multi-computer systems necessitate the use of partitioning tools to subdivide the mesh into sub-meshes of roughly equal size. graph partitioning algorithms are mandatory when implementing distributed sparse matrix methods or domain decomposition techniques for irregularly structured problems, on parallel computers. we propose a class of algorithms which are based on level set expansions from a number of center nodes. a ... surrounding text:even though multilevel algorithms are quite fast compared with spectral methods, they can still be the bottleneck if the sparse system of equations is being solved in parallel [32, 18]<2>. the coarsening phase of these methods is relatively easy to parallelize [30]<2>, but the kl heuristic used in the refinement phase is very difficult to parallelize [***]<2>. since both the coarsening phase and the refinement phase with the kl heuristic take roughly the same amount of time, the overall runtime of the multilevel scheme of [26]<2> cannot be reduced significantly. schemes that require multiple trials are inherently parallel, as di erent trials can be done on di erent processors. in contrast, a single trial of kl is very difficult to parallelize [***]<2> and appears inherently serial in nature. multilevel schemes that do not rely upon kl [30]<2> and the spectral bisection scheme are moderately parallel in nature influence:2 type:2 pair index:114 citer id:52 citer title:a fast and high quality multilevel scheme for partitioning irregular graphs citer abstract:recently, a number of researchers have investigated a class of graph partitioning algorithms that reduce the size of the graph by collapsing vertices and edges, partition the smaller graph, and then uncoarsen it to construct a partition for the original graph . from the early work it was clear that multilevel techniques held great promise; however, it was not known if they can be made to consistently produce high quality partitions for graphs arising in a wide range of application domains. we investigate the e ectiveness of many di erent choices for all three phases: coarsening, partition of the coarsest graph, and refinement. in particular, we present a new coarsening heuristic (called heavy-edge heuristic) for which the size of the partition of the coarse graph is within a small factor of the size of the final partition obtained after multilevel refinement. we also present a much faster variation of the kernighan{lin (kl) algorithm for refining during uncoarsening. we test our scheme on a large number of graphs arising in various domains including finite element methods, linear programming, vlsi, and transportation. our experiments show that our scheme produces partitions that are consistently better than those produced by spectral partitioning schemes in substantially smaller time. also, when our scheme is used to compute fill-reducing orderings for sparse matrices, it produces orderings that have substantially smaller fill than the widely used multiple minimum degree algorithm citee id:63 citee title:highly scalable parallel algorithms for sparse matrix factorization citee abstract:in this paper, we describe scalable parallel algorithms for sparse matrix factorization, analyze their performance and scalability, and present experimental results for up to 1024 processors on a cray t3d parallel computer. through our analysis and experimental results, we demonstrate that our algorithms substantially improve the state of the art in parallel direct solution of sparse linear systems - both in terms of scalability and overall performance. it is a well known fact that dense matrix factorization scales well and can be implemented efficiently on parallel computers. in this paper, we present the first algorithms to factor a wide class of sparse matrices (including those arising from two- and three-dimensional finite element problems) that are asymptotically as scalable as dense matrix factorization algorithms on a variety of parallel architectures. our algorithms incur less communication overhead and are more scalable than any previously known parallel formulation of sparse matrix factorization. although, in this paper, we discuss cholesky factorization of symmetric positive definite matrices, the algorithms can be adapted for solving sparse linear least squares problems and for gaussian elimination of diagonally dominant matrices that are almost symmetric in structure. an implementation of one of our sparse cholesky factorization algorithms delivers up to 20 gflops on a cray t3d for medium-size structural engineering and linear programming problems. to the best of our knowledge, this is the highest performance ever obtained for sparse cholesky factorization on any supercomputer. surrounding text:however, even msb can take a large amount of time. in particular, in parallel direct solvers, the time for computing ordering using msb can be several orders of magnitude higher than the time taken by the parallel factorization algorithm, and thus ordering time can dominate the overall time to solve the problem [***]<2>. another class of graph partitioning techniques uses the geometric information of the graph to find a good partition. surprisingly, our scheme substantially outperforms the multiple minimum degree algorithm [35]<2>, which is the most commonly used method for computing fill-reducing orderings of a sparse matrix. even though multilevel algorithms are quite fast compared with spectral methods, they can still be the bottleneck if the sparse system of equations is being solved in parallel [32, ***]<2>. the coarsening phase of these methods is relatively easy to parallelize [30]<2>, but the kl heuristic used in the refinement phase is very difficult to parallelize [16]<2>. bars under the baseline indicate that mlnd performs better than mmd. elimination trees produced by mmd (a) exhibit little concurrency (long and slender) and (b) are unbalanced so that subtree-to-subcube mappings lead to significant load imbalances [32, 12, ***]<2>. on the other hand, orderings based on nested dissection produce orderings that have both more concurrency and better balance [27, 22]<2> influence:2 type:2 pair index:115 citer id:52 citer title:a fast and high quality multilevel scheme for partitioning irregular graphs citer abstract:recently, a number of researchers have investigated a class of graph partitioning algorithms that reduce the size of the graph by collapsing vertices and edges, partition the smaller graph, and then uncoarsen it to construct a partition for the original graph . from the early work it was clear that multilevel techniques held great promise; however, it was not known if they can be made to consistently produce high quality partitions for graphs arising in a wide range of application domains. we investigate the e ectiveness of many di erent choices for all three phases: coarsening, partition of the coarsest graph, and refinement. in particular, we present a new coarsening heuristic (called heavy-edge heuristic) for which the size of the partition of the coarse graph is within a small factor of the size of the final partition obtained after multilevel refinement. we also present a much faster variation of the kernighan{lin (kl) algorithm for refining during uncoarsening. we test our scheme on a large number of graphs arising in various domains including finite element methods, linear programming, vlsi, and transportation. our experiments show that our scheme produces partitions that are consistently better than those produced by spectral partitioning schemes in substantially smaller time. also, when our scheme is used to compute fill-reducing orderings for sparse matrices, it produces orderings that have substantially smaller fill than the widely used multiple minimum degree algorithm citee id:64 citee title:fast spectral methods for ratio cut partitioning and clustering citee abstract:the ratio cut partitioning objective function successfully embodies both the traditional min-cut and equipartition goals of partitioning. fiduccia-mattheyses style ratio cut heuristics have achieved cost savings averaging over 39% for circuit partitioning and over 50% for hardware simulation applications. the authors show a theoretical correspondence between the optimal ratio cut partition cost and the second smallest eigenvalue of a particular netlist-derived matrix, and present fast lanczos-based methods for computing heuristic ratio cuts from the eigenvector of this second eigenvalue. results are better than those of previous methods, e.g. by an average of 17% for the primary mcnc benchmarks. an efficient clustering method, also based on the second eigenvector, is very successful on the `difficult' input classes in the cad (computer-aided design) literature. extensions and directions for future work are also considered surrounding text:another class of graph partitioning algorithms reduces the size of the graph (i$e$, coarsen the graph) by collapsing vertices and edges, partitions the smaller graph, and then uncoarsens it to construct a partition for the original graph. these are called multilevel graph partitioning schemes [4, 7, ***, 20, 26, 10, 43]<2>. some researchers investigated multilevel schemes primarily to decrease the partitioning time, at the cost of somewhat worse partition quality [43]<2> influence:2 type:2 pair index:116 citer id:52 citer title:a fast and high quality multilevel scheme for partitioning irregular graphs citer abstract:recently, a number of researchers have investigated a class of graph partitioning algorithms that reduce the size of the graph by collapsing vertices and edges, partition the smaller graph, and then uncoarsen it to construct a partition for the original graph . from the early work it was clear that multilevel techniques held great promise; however, it was not known if they can be made to consistently produce high quality partitions for graphs arising in a wide range of application domains. we investigate the e ectiveness of many di erent choices for all three phases: coarsening, partition of the coarsest graph, and refinement. in particular, we present a new coarsening heuristic (called heavy-edge heuristic) for which the size of the partition of the coarse graph is within a small factor of the size of the final partition obtained after multilevel refinement. we also present a much faster variation of the kernighan{lin (kl) algorithm for refining during uncoarsening. we test our scheme on a large number of graphs arising in various domains including finite element methods, linear programming, vlsi, and transportation. our experiments show that our scheme produces partitions that are consistently better than those produced by spectral partitioning schemes in substantially smaller time. also, when our scheme is used to compute fill-reducing orderings for sparse matrices, it produces orderings that have substantially smaller fill than the widely used multiple minimum degree algorithm citee id:65 citee title:a new approach to e ective circuit clustering citee abstract:the complexity of next-generation vlsi systems will exceed the capabilities of top-down layout synthesis algorithms, particularly in netlist partitioning and module placement. bottom-up clustering is needed to ��condense�� the netlist so that the problem size becomes tractable to existing optimization methods. in this paper, we establish the ds qua.lity measure, the first general metric for evaluation of clustering algorithms. the ds metric in turn motivates our rwst algorithm, a new self-tuning clustering method based on random walks in the circuit netlist. rwst efficiently captures a globally good circuit clustering. when incorporated within a two-phase iterative fiduccia-mattheyses partitioning strategy, the rw-st clustering method improves bisection width by an average of 17% over previous maiching-based methods. surrounding text:another class of graph partitioning algorithms reduces the size of the graph (i$e$, coarsen the graph) by collapsing vertices and edges, partitions the smaller graph, and then uncoarsens it to construct a partition for the original graph. these are called multilevel graph partitioning schemes [4, 7, 19, ***, 26, 10, 43]<2>. some researchers investigated multilevel schemes primarily to decrease the partitioning time, at the cost of somewhat worse partition quality [43]<2>. some researchers investigated multilevel schemes primarily to decrease the partitioning time, at the cost of somewhat worse partition quality [43]<2>. recently, a number of multilevel algorithms have been proposed [4, 26, 7, ***, 10]<2> that further refine the partition during the uncoarsening phase. these schemes tend to give good partitions at a reasonable cost influence:2 type:2 pair index:117 citer id:52 citer title:a fast and high quality multilevel scheme for partitioning irregular graphs citer abstract:recently, a number of researchers have investigated a class of graph partitioning algorithms that reduce the size of the graph by collapsing vertices and edges, partition the smaller graph, and then uncoarsen it to construct a partition for the original graph . from the early work it was clear that multilevel techniques held great promise; however, it was not known if they can be made to consistently produce high quality partitions for graphs arising in a wide range of application domains. we investigate the e ectiveness of many di erent choices for all three phases: coarsening, partition of the coarsest graph, and refinement. in particular, we present a new coarsening heuristic (called heavy-edge heuristic) for which the size of the partition of the coarse graph is within a small factor of the size of the final partition obtained after multilevel refinement. we also present a much faster variation of the kernighan{lin (kl) algorithm for refining during uncoarsening. we test our scheme on a large number of graphs arising in various domains including finite element methods, linear programming, vlsi, and transportation. our experiments show that our scheme produces partitions that are consistently better than those produced by spectral partitioning schemes in substantially smaller time. also, when our scheme is used to compute fill-reducing orderings for sparse matrices, it produces orderings that have substantially smaller fill than the widely used multiple minimum degree algorithm citee id:66 citee title:parallel algorithms for matrix computations citee abstract:it is shown that a mesh-connected n��n multiprocessor system can compute the inverse of a n��n matrix in linear time to n. the algorithm is based on a theorem known to sylvester in 1851. it computes the cofactor matrix in n steps, each of which involves 4 unit distance message routing and 4 arithmetic operations for every processor. a coding and memory requirement for each processor is the same and is independent of n. it is also shown that the same algorithm solves systems of n linear equations in linear time of n with n �� (n+l) processors surrounding text:thus, in order to minimize the communication overhead, we need to obtain a p-way partition of ga and then to distribute the rows of a according to this partition. another important application of recursive bisection is to find a fill-reducing ordering for sparse matrix factorization [12, 32, ***]<3>. these algorithms are generally referred to as nested dissection ordering algorithms. thus, the problem of performing a k-way partition can be solved by performing a sequence of 2-way partitions or bisections. even though this scheme does not necessarily lead to optimal partition, it is used extensively due to its simplicity [12, ***]<3>. 2 influence:2 type:2 pair index:118 citer id:52 citer title:a fast and high quality multilevel scheme for partitioning irregular graphs citer abstract:recently, a number of researchers have investigated a class of graph partitioning algorithms that reduce the size of the graph by collapsing vertices and edges, partition the smaller graph, and then uncoarsen it to construct a partition for the original graph . from the early work it was clear that multilevel techniques held great promise; however, it was not known if they can be made to consistently produce high quality partitions for graphs arising in a wide range of application domains. we investigate the e ectiveness of many di erent choices for all three phases: coarsening, partition of the coarsest graph, and refinement. in particular, we present a new coarsening heuristic (called heavy-edge heuristic) for which the size of the partition of the coarse graph is within a small factor of the size of the final partition obtained after multilevel refinement. we also present a much faster variation of the kernighan{lin (kl) algorithm for refining during uncoarsening. we test our scheme on a large number of graphs arising in various domains including finite element methods, linear programming, vlsi, and transportation. our experiments show that our scheme produces partitions that are consistently better than those produced by spectral partitioning schemes in substantially smaller time. also, when our scheme is used to compute fill-reducing orderings for sparse matrices, it produces orderings that have substantially smaller fill than the widely used multiple minimum degree algorithm citee id:8 citee title:a cartesian parallel nested dissection algorithm citee abstract:this paper is concerned with the distri uted parallel computation of an ordering for a symmetric positive de nite sparse matrix. the purpose of the ordering is to limit ll and enhance concurrency in the su se uent computation of the cholesky factori ation of the matrix. we use a geometric approach to nested dissection ased on a given cartesian em edding of the graph of the matrix in euclidean space. the resulting algorithm can e implemented e ciently on massively parallel, distri uted memory... surrounding text:another class of graph partitioning techniques uses the geometric information of the graph to find a good partition. geometric partitioning algorithms [***, 48, 37, 36, 38]<2> tend to be fast but often yield partitions that are worse than those obtained by spectral methods. among the most prominent of these schemes is the algorithm described in [37, 36]<2>. in the absence of extensive data, we could not have done any better anyway. in table 9 we show three di erent variations of spectral partitioning [45, 47, 26, 2]<2>, the multilevel partitioning described in this paper, the levelized nested dissection [11]<2>, the kl partition [31]<2>, the coordinate nested dissection (cnd) [***]<2>, two variations of the inertial partition [38, 25]<2>, and two variants of geometric partitioning [37, 36, 15]<2>. for each graph partitioning algorithm, table 9 shows a number of characteristics. schemes that rely on coordinate information do not seem to have this limitation, and in principle it appears that these schemes can be parallelized quite e ectively. however, all available parallel formulation of these schemes [***, 8]<2> obtained no better speedup than obtained for the multilevel scheme in [30]<2>. 9 influence:2 type:2 pair index:119 citer id:52 citer title:a fast and high quality multilevel scheme for partitioning irregular graphs citer abstract:recently, a number of researchers have investigated a class of graph partitioning algorithms that reduce the size of the graph by collapsing vertices and edges, partition the smaller graph, and then uncoarsen it to construct a partition for the original graph . from the early work it was clear that multilevel techniques held great promise; however, it was not known if they can be made to consistently produce high quality partitions for graphs arising in a wide range of application domains. we investigate the e ectiveness of many di erent choices for all three phases: coarsening, partition of the coarsest graph, and refinement. in particular, we present a new coarsening heuristic (called heavy-edge heuristic) for which the size of the partition of the coarse graph is within a small factor of the size of the final partition obtained after multilevel refinement. we also present a much faster variation of the kernighan{lin (kl) algorithm for refining during uncoarsening. we test our scheme on a large number of graphs arising in various domains including finite element methods, linear programming, vlsi, and transportation. our experiments show that our scheme produces partitions that are consistently better than those produced by spectral partitioning schemes in substantially smaller time. also, when our scheme is used to compute fill-reducing orderings for sparse matrices, it produces orderings that have substantially smaller fill than the widely used multiple minimum degree algorithm citee id:67 citee title:an improved spectral graph partitioning algorithm for mapping parallel computations citee abstract:efficient use of a distributed memory parallel computer requires that the computational load be balanced across processors in a way that minimizes interprocessor communication. a new domain mapping algorithm is presented that extends recent work in which ideas from spectral graph theory have been applied to this problem. the generalization of spectral graph bisection involves a novel use of multiple eigenvectors to allow for division of a computation into four or eight parts at each stage of a recursive decomposition. the resulting method is suitable for scientific computations like irregular finite elements or differences performed on hypercube or mesh architecture machines. experimental results confirm that the new method provides better decompositions arrived at more economically and robustly than with previous spectral methods. this algorithm allows for arbitrary nonnegative weights on both vertices and edges to model inhomogeneous computation and communication. a new spectral lower bound for graph bisection is also presented surrounding text:the graph partitioning problem is np-complete, however, many algorithms have been developed that find a reasonably good partition. spectral partitioning methods are known to produce good partitions for a wide class of problems, and they are used quite extensively [45, 47, ***]<2>. however, these methods are very expensive since they require the computation of the eigenvector corresponding to the second smallest eigenvalue (fiedler vector) influence:2 type:2 pair index:120 citer id:52 citer title:a fast and high quality multilevel scheme for partitioning irregular graphs citer abstract:recently, a number of researchers have investigated a class of graph partitioning algorithms that reduce the size of the graph by collapsing vertices and edges, partition the smaller graph, and then uncoarsen it to construct a partition for the original graph . from the early work it was clear that multilevel techniques held great promise; however, it was not known if they can be made to consistently produce high quality partitions for graphs arising in a wide range of application domains. we investigate the e ectiveness of many di erent choices for all three phases: coarsening, partition of the coarsest graph, and refinement. in particular, we present a new coarsening heuristic (called heavy-edge heuristic) for which the size of the partition of the coarse graph is within a small factor of the size of the final partition obtained after multilevel refinement. we also present a much faster variation of the kernighan{lin (kl) algorithm for refining during uncoarsening. we test our scheme on a large number of graphs arising in various domains including finite element methods, linear programming, vlsi, and transportation. our experiments show that our scheme produces partitions that are consistently better than those produced by spectral partitioning schemes in substantially smaller time. also, when our scheme is used to compute fill-reducing orderings for sparse matrices, it produces orderings that have substantially smaller fill than the widely used multiple minimum degree algorithm citee id:68 citee title:a multilevel algorithm for partitioning graphs citee abstract:the graph partitioning problem is that of dividing the vertices of a graph into sets of specified sizes such that few edges cross between setsfi this np--complete problem arises in many important scientific and engineering problemsfi prominent examples include the mapping of parallel computations, the laying out of circuits and the ordering of sparse matrix computationsfi we present a multilevel algorithm for graph partitioning in which the graph is approximated by a sequence of increasingly surrounding text:another class of graph partitioning algorithms reduces the size of the graph (i$e$, coarsen the graph) by collapsing vertices and edges, partitions the smaller graph, and then uncoarsens it to construct a partition for the original graph. these are called multilevel graph partitioning schemes [4, 7, 19, 20, ***, 10, 43]<2>. some researchers investigated multilevel schemes primarily to decrease the partitioning time, at the cost of somewhat worse partition quality [43]<2>. some researchers investigated multilevel schemes primarily to decrease the partitioning time, at the cost of somewhat worse partition quality [43]<2>. recently, a number of multilevel algorithms have been proposed [4, ***, 7, 20, 10]<2> that further refine the partition during the uncoarsening phase. these schemes tend to give good partitions at a reasonable cost. they partition the smallest graph and then uncoarsen the graph level by level, applying the kl algorithm to refine the partition. hendrickson and leland [***]<2> enhance this approach by using edge and vertex weights to capture the collapsing of the vertex and edges. in particular, this latter work showed that multilevel schemes can provide better partitions than spectral methods at lower cost for a variety of finite element problems. in particular, we present a new coarsening heuristic (called heavy-edge heuristic) for which the size of the partition of the coarse graph is within a small factor of the size of the final partition obtained after multilevel refinement. we also present a new variation of the kl algorithm for refining the partition during the uncoarsening phase that is much faster than the kl refinement used in [***]<2>. we test our scheme on a large number of graphs arising in various domains including finite element methods, linear programming, vlsi, and transportation. our experiments show that our scheme consistently produces partitions that are better than those produced by spectral partitioning schemes in substantially smaller times (10 to 35 times faster than multilevel spectral bisection). 1 compared with the multilevel scheme of [***]<2>, our scheme is about two to seven times faster, and it is consistently better in terms of cut size. much of the improvement in runtime comes from our faster refinement heuristic, and the improvement in quality is due to the heavy-edge heuristic used during coarsening. the coarsening phase of these methods is relatively easy to parallelize [30]<2>, but the kl heuristic used in the refinement phase is very difficult to parallelize [16]<2>. since both the coarsening phase and the refinement phase with the kl heuristic take roughly the same amount of time, the overall runtime of the multilevel scheme of [***]<2> cannot be reduced significantly. our new faster methods for refinement reduce this bottleneck substantially. however, kl refinement further increases the runtime of the overall scheme as shown in figure 6, making the di erence in the runtime of msb-kl and our multilevel algorithm even greater. the graph partitioning package chaco implements its own multilevel graph partitioning algorithm that is modeled after the algorithm by hendrickson and leland [***, 25]<2>. this algorithm, which we refer to as chaco-ml, uses rm during coarsening, sb for partitioning the coarse graph, and kl refinement every other coarsening level during the uncoarsening phase. in the absence of extensive data, we could not have done any better anyway. in table 9 we show three di erent variations of spectral partitioning [45, 47, ***, 2]<2>, the multilevel partitioning described in this paper, the levelized nested dissection [11]<2>, the kl partition [31]<2>, the coordinate nested dissection (cnd) [23]<2>, two variations of the inertial partition [38, 25]<2>, and two variants of geometric partitioning [37, 36, 15]<2>. for each graph partitioning algorithm, table 9 shows a number of characteristics influence:1 type:2 pair index:121 citer id:52 citer title:a fast and high quality multilevel scheme for partitioning irregular graphs citer abstract:recently, a number of researchers have investigated a class of graph partitioning algorithms that reduce the size of the graph by collapsing vertices and edges, partition the smaller graph, and then uncoarsen it to construct a partition for the original graph . from the early work it was clear that multilevel techniques held great promise; however, it was not known if they can be made to consistently produce high quality partitions for graphs arising in a wide range of application domains. we investigate the e ectiveness of many di erent choices for all three phases: coarsening, partition of the coarsest graph, and refinement. in particular, we present a new coarsening heuristic (called heavy-edge heuristic) for which the size of the partition of the coarse graph is within a small factor of the size of the final partition obtained after multilevel refinement. we also present a much faster variation of the kernighan{lin (kl) algorithm for refining during uncoarsening. we test our scheme on a large number of graphs arising in various domains including finite element methods, linear programming, vlsi, and transportation. our experiments show that our scheme produces partitions that are consistently better than those produced by spectral partitioning schemes in substantially smaller time. also, when our scheme is used to compute fill-reducing orderings for sparse matrices, it produces orderings that have substantially smaller fill than the widely used multiple minimum degree algorithm citee id:69 citee title:a parallel formulation of interior point algorithms citee abstract: in recent years, interior point algorithms have been used successfully for solving medium - to large - size linear programming (lp) problems in this paper we describe a highly parallel formulation of the interior point algorithm a key component of the interior point algorithm is the solution of a sparse system of linear equations using cholesky factorization the performance of parallel cholesky factorization is determined by (a) the communication overhead incurred by the algorithm, and (b) the load imbalance among the processors in our parallel interior point algorithm, we use our recently developed parallel multifrontal algorithm that has the smallest communication overhead over all parallel algorithms for cholesky factorization developed to date the computation imbalance depends on the shape of the elimination tree associated with the sparse system reordered for factorization to balance the computation, we implemented and evaluated four di erent ordering algorithms among these algorithms, kernighan - lin and spectral nested dissection yield the most balanced elimination trees and greatly increase the amount of parallelism that can be exploited our preliminary implementation achieves a speedup as high as 108 on 256 - processor ncube 2 on moderate - size problems surrounding text:elimination trees produced by mmd (a) exhibit little concurrency (long and slender) and (b) are unbalanced so that subtree-to-subcube mappings lead to significant load imbalances [32, 12, 18]<2>. on the other hand, orderings based on nested dissection produce orderings that have both more concurrency and better balance [***, 22]<2>. therefore, when the factorization is performed in parallel, the better utilization of the processors can cause the ratio of the runtime of parallel factorization algorithms ordered using mmd and that using mlnd to be substantially higher than the ratio of their respective operation counts influence:3 type:3 pair index:122 citer id:52 citer title:a fast and high quality multilevel scheme for partitioning irregular graphs citer abstract:recently, a number of researchers have investigated a class of graph partitioning algorithms that reduce the size of the graph by collapsing vertices and edges, partition the smaller graph, and then uncoarsen it to construct a partition for the original graph . from the early work it was clear that multilevel techniques held great promise; however, it was not known if they can be made to consistently produce high quality partitions for graphs arising in a wide range of application domains. we investigate the e ectiveness of many di erent choices for all three phases: coarsening, partition of the coarsest graph, and refinement. in particular, we present a new coarsening heuristic (called heavy-edge heuristic) for which the size of the partition of the coarse graph is within a small factor of the size of the final partition obtained after multilevel refinement. we also present a much faster variation of the kernighan{lin (kl) algorithm for refining during uncoarsening. we test our scheme on a large number of graphs arising in various domains including finite element methods, linear programming, vlsi, and transportation. our experiments show that our scheme produces partitions that are consistently better than those produced by spectral partitioning schemes in substantially smaller time. also, when our scheme is used to compute fill-reducing orderings for sparse matrices, it produces orderings that have substantially smaller fill than the widely used multiple minimum degree algorithm citee id:70 citee title:analysis of multilevel graph partitioning citee abstract:recently, a number of researchers have investigateda class of algorithms that are based on multilevel graphpartitioning that have moderate computational complexity,and provide excellent graph partitionsfi however, thereexists little theoretical analysis that could explain the abilityof multilevel algorithms to produce good partitionsfi in thispaper we present such an analysisfi we show under certainreasonable assumptions that even if no refinement is usedin the uncoarsening phase, surrounding text:if g0 is (maximal) planar, the gi is also (maximal) planar [34]<3>. this property is used to show that the multilevel algorithm produces partitions that are provably good for planar graphs [***]<1>. since maximal matchings are used to coarsen the graph, the number of vertices in gi+1 cannot be less than half the number of vertices in gi. hence, by selecting a maximal matching mi whose edges have a large weight, we can decrease the edge-weight of the coarser graph by a greater amount. as the analysis in [***]<1> shows, since the coarser graph has smaller edge-weight, it also has a smaller edge-cut. finding a maximal matching that contains edges with large weight is the idea behind the hem. this is because the multilevel graph partitioning captures global graph structure at two di erent levels. first, it captures global structure through the process of coarsening [***]<1>, and, second, it captures global structure during the initial graph partitioning by performing multiple trials. the sixth column of table 9 shows the relative time required by di erent graph partitioning schemes influence:1 type:2 pair index:123 citer id:52 citer title:a fast and high quality multilevel scheme for partitioning irregular graphs citer abstract:recently, a number of researchers have investigated a class of graph partitioning algorithms that reduce the size of the graph by collapsing vertices and edges, partition the smaller graph, and then uncoarsen it to construct a partition for the original graph . from the early work it was clear that multilevel techniques held great promise; however, it was not known if they can be made to consistently produce high quality partitions for graphs arising in a wide range of application domains. we investigate the e ectiveness of many di erent choices for all three phases: coarsening, partition of the coarsest graph, and refinement. in particular, we present a new coarsening heuristic (called heavy-edge heuristic) for which the size of the partition of the coarse graph is within a small factor of the size of the final partition obtained after multilevel refinement. we also present a much faster variation of the kernighan{lin (kl) algorithm for refining during uncoarsening. we test our scheme on a large number of graphs arising in various domains including finite element methods, linear programming, vlsi, and transportation. our experiments show that our scheme produces partitions that are consistently better than those produced by spectral partitioning schemes in substantially smaller time. also, when our scheme is used to compute fill-reducing orderings for sparse matrices, it produces orderings that have substantially smaller fill than the widely used multiple minimum degree algorithm citee id:71 citee title:a parallel algorithm for multilevel graph partitioning and sparse matrix ordering citee abstract:in this paper we present a parallel formulation of the multilevel graph partitioning and sparse matrix orderingalgorithmfi a key feature of our parallel formulation (that distinguishes it from other proposed parallel formulations ofmultilevel algorithms is that it partitions the vertices of the graph into p p parts while distributing the overall adjacencymatrix of the graph among all p processorsfi this mapping results in substantially smaller communication than onedimensionaldistribution for surrounding text:even though multilevel algorithms are quite fast compared with spectral methods, they can still be the bottleneck if the sparse system of equations is being solved in parallel [32, 18]<2>. the coarsening phase of these methods is relatively easy to parallelize [***]<2>, but the kl heuristic used in the refinement phase is very difficult to parallelize [16]<2>. since both the coarsening phase and the refinement phase with the kl heuristic take roughly the same amount of time, the overall runtime of the multilevel scheme of [26]<2> cannot be reduced significantly. 4, the runtime of snd is substantially higher than that of mlnd. also, snd cannot be parallelized any better than mlnd [***, 1]<2>. therefore, it will always be slower than mlnd. in contrast, a single trial of kl is very difficult to parallelize [16]<2> and appears inherently serial in nature. multilevel schemes that do not rely upon kl [***]<2> and the spectral bisection scheme are moderately parallel in nature. as discussed in [***]<2>, the asymptotic speedup for these schemes is bounded by o(pp). multilevel schemes that do not rely upon kl [***]<2> and the spectral bisection scheme are moderately parallel in nature. as discussed in [***]<2>, the asymptotic speedup for these schemes is bounded by o(pp). o(p) speedup can be obtained in these schemes only if the graph is nearly well partitioned among processors. schemes that rely on coordinate information do not seem to have this limitation, and in principle it appears that these schemes can be parallelized quite e ectively. however, all available parallel formulation of these schemes [23, 8]<2> obtained no better speedup than obtained for the multilevel scheme in [***]<2>. 9. the reason is that this combination requires very little time for refinement, which is the most serial part of the algorithm. the coarsening phase is relatively much easier to parallelize [***]<2>. 388 george karypis and vipin kumar appendix a influence:1 type:2 pair index:124 citer id:52 citer title:a fast and high quality multilevel scheme for partitioning irregular graphs citer abstract:recently, a number of researchers have investigated a class of graph partitioning algorithms that reduce the size of the graph by collapsing vertices and edges, partition the smaller graph, and then uncoarsen it to construct a partition for the original graph . from the early work it was clear that multilevel techniques held great promise; however, it was not known if they can be made to consistently produce high quality partitions for graphs arising in a wide range of application domains. we investigate the e ectiveness of many di erent choices for all three phases: coarsening, partition of the coarsest graph, and refinement. in particular, we present a new coarsening heuristic (called heavy-edge heuristic) for which the size of the partition of the coarse graph is within a small factor of the size of the final partition obtained after multilevel refinement. we also present a much faster variation of the kernighan{lin (kl) algorithm for refining during uncoarsening. we test our scheme on a large number of graphs arising in various domains including finite element methods, linear programming, vlsi, and transportation. our experiments show that our scheme produces partitions that are consistently better than those produced by spectral partitioning schemes in substantially smaller time. also, when our scheme is used to compute fill-reducing orderings for sparse matrices, it produces orderings that have substantially smaller fill than the widely used multiple minimum degree algorithm citee id:72 citee title:introduction to parallel computing: design and analysis of algorithms citee abstract:introduction to parallel computing provides an in-depth look at techniques for the design and analysis of parallel algorithms and for programming these algorithms on commercially available parallel platforms. the book discusses principles of parallel algorithm design and different parallel programming models with extensive coverage of mpi, posix threads, and openmp. it provides a broad and balanced coverage of various core topics such as sorting, graph algorithms, discrete optimization techniques, data-mining algorithms, and a number of algorithms used in numerical and scientific computing applications. the basic approach advocated in this text is one of portable parallel algorithm and software development, an empahsis lacking in all existing textbooks on parallel computing. to enhance the pedagogical value of the text, extensive examples, diagrams, exercises of varying degrees of difficulty, and bibliographical remarks are provided. in addition to serving as a textbook and a reference source for professionals and parallel software developers, the book will help students and researchers in non computer-science disciplines who need to solve computation-intensive problems using parallel computers. surrounding text:a key step in each iteration of these methods is the multiplication of a sparse matrix and a (dense) vector. a good partition of the graph corresponding to matrix a can significantly reduce the amount of communication in parallel sparse matrix-vector multiplication [***]<1>. if parallel direct methods are used to solve a sparse system of equations, then a graph partitioning algorithm can be used to compute a fill-reducing ordering that leads to a high degree of concurrency in the factorization phase [***, 12]<1>. a good partition of the graph corresponding to matrix a can significantly reduce the amount of communication in parallel sparse matrix-vector multiplication [***]<1>. if parallel direct methods are used to solve a sparse system of equations, then a graph partitioning algorithm can be used to compute a fill-reducing ordering that leads to a high degree of concurrency in the factorization phase [***, 12]<1>. the multiple minimum degree ordering used almost exclusively in serial direct methods is not suitable for parallel direct methods, as it provides very little concurrency in the parallel factorization phase influence:2 type:2 pair index:125 citer id:52 citer title:a fast and high quality multilevel scheme for partitioning irregular graphs citer abstract:recently, a number of researchers have investigated a class of graph partitioning algorithms that reduce the size of the graph by collapsing vertices and edges, partition the smaller graph, and then uncoarsen it to construct a partition for the original graph . from the early work it was clear that multilevel techniques held great promise; however, it was not known if they can be made to consistently produce high quality partitions for graphs arising in a wide range of application domains. we investigate the e ectiveness of many di erent choices for all three phases: coarsening, partition of the coarsest graph, and refinement. in particular, we present a new coarsening heuristic (called heavy-edge heuristic) for which the size of the partition of the coarse graph is within a small factor of the size of the final partition obtained after multilevel refinement. we also present a much faster variation of the kernighan{lin (kl) algorithm for refining during uncoarsening. we test our scheme on a large number of graphs arising in various domains including finite element methods, linear programming, vlsi, and transportation. our experiments show that our scheme produces partitions that are consistently better than those produced by spectral partitioning schemes in substantially smaller time. also, when our scheme is used to compute fill-reducing orderings for sparse matrices, it produces orderings that have substantially smaller fill than the widely used multiple minimum degree algorithm citee id:73 citee title:a separator theorem for planar graphs citee abstract:let g be any n-vertex planar graph. we prove that the verticesof g can be partitioned into three sets a, b, c such that no edgejoins a vertex in a with a vertex in b , neither a nor b containsmore than 2n/3 vertices, and c contains no more than 2&& vertices.-.we exhibit an algorithm which finds such a partition a, b, c in o(n)time. surrounding text:coarsening a graph using matchings preserves many properties of the original graph. if g0 is (maximal) planar, the gi is also (maximal) planar [***]<3>. this property is used to show that the multilevel algorithm produces partitions that are provably good for planar graphs [28]<1> influence:2 type:2 pair index:126 citer id:52 citer title:a fast and high quality multilevel scheme for partitioning irregular graphs citer abstract:recently, a number of researchers have investigated a class of graph partitioning algorithms that reduce the size of the graph by collapsing vertices and edges, partition the smaller graph, and then uncoarsen it to construct a partition for the original graph . from the early work it was clear that multilevel techniques held great promise; however, it was not known if they can be made to consistently produce high quality partitions for graphs arising in a wide range of application domains. we investigate the e ectiveness of many di erent choices for all three phases: coarsening, partition of the coarsest graph, and refinement. in particular, we present a new coarsening heuristic (called heavy-edge heuristic) for which the size of the partition of the coarse graph is within a small factor of the size of the final partition obtained after multilevel refinement. we also present a much faster variation of the kernighan{lin (kl) algorithm for refining during uncoarsening. we test our scheme on a large number of graphs arising in various domains including finite element methods, linear programming, vlsi, and transportation. our experiments show that our scheme produces partitions that are consistently better than those produced by spectral partitioning schemes in substantially smaller time. also, when our scheme is used to compute fill-reducing orderings for sparse matrices, it produces orderings that have substantially smaller fill than the widely used multiple minimum degree algorithm citee id:74 citee title:modification of the minimum degree algorithm by multiple elimination citee abstract:the most widely used ordering scheme to reduce fills and operations in sparse matrix computation is the minimum-degree algorithm. the notion of multiple elimination is introduced here as a modification to the conventional scheme. the motivation is discussed using the k-by-k grid model problem. experimental results indicate that the modified version retains the fill-reducing property of (and is often better than) the original ordering algorithm and yet requires less computer time. the reduction in ordering time is problem dependent, and for some problems the modified algorithm can run a few times faster than existing implementations of the minimum-degree algorithm. the use of external degree in the algorithm is also introduced. surrounding text:we also used our graph partitioning scheme to compute fill-reducing orderings for sparse matrices. surprisingly, our scheme substantially outperforms the multiple minimum degree algorithm [***]<2>, which is the most commonly used method for computing fill-reducing orderings of a sparse matrix. even though multilevel algorithms are quite fast compared with spectral methods, they can still be the bottleneck if the sparse system of equations is being solved in parallel [32, 18]<2>. on a parallel computer, a fill-reducing ordering, besides minimizing the operation count, should also increase the degree of concurrency that can be exploited during factorization. in general, nested dissectionbased orderings exhibit more concurrency during factorization than minimum degree orderings [13, ***]<2>. the minimum degree [13]<2> ordering heuristic is the most widely used fill-reducing algorithm that is used to order sparse matrices for factorization on serial computers. the minimum degree algorithm has been found to produce very good orderings. the multiple minimum degree algorithm [***]<2> is the most widely used variant of minimum degree due to its very fast runtime. the quality of the orderings produced by our mlnd algorithm compared with that of mmd is shown in table 8 and figure 7 influence:1 type:2 pair index:127 citer id:52 citer title:a fast and high quality multilevel scheme for partitioning irregular graphs citer abstract:recently, a number of researchers have investigated a class of graph partitioning algorithms that reduce the size of the graph by collapsing vertices and edges, partition the smaller graph, and then uncoarsen it to construct a partition for the original graph . from the early work it was clear that multilevel techniques held great promise; however, it was not known if they can be made to consistently produce high quality partitions for graphs arising in a wide range of application domains. we investigate the e ectiveness of many di erent choices for all three phases: coarsening, partition of the coarsest graph, and refinement. in particular, we present a new coarsening heuristic (called heavy-edge heuristic) for which the size of the partition of the coarse graph is within a small factor of the size of the final partition obtained after multilevel refinement. we also present a much faster variation of the kernighan{lin (kl) algorithm for refining during uncoarsening. we test our scheme on a large number of graphs arising in various domains including finite element methods, linear programming, vlsi, and transportation. our experiments show that our scheme produces partitions that are consistently better than those produced by spectral partitioning schemes in substantially smaller time. also, when our scheme is used to compute fill-reducing orderings for sparse matrices, it produces orderings that have substantially smaller fill than the widely used multiple minimum degree algorithm citee id:75 citee title:automatic mesh partitioning citee abstract:this paper describes an efficient approach to partitioning unstructured meshes that occur naturally in the finite element and finite difference methods. the approach makes use of the underlying geometric structure of a given mesh and finds a provably good partition in random o(n) time. it applies to meshes in both two and three dimensions. the new method has applications in efficient sequential and parallel algorithms for large-scale problems in scientific computing. this is an overview paper written with emphasis on the algorithmic aspects of the approach. many detailed proofs can be found in companion papers. surrounding text:another class of graph partitioning techniques uses the geometric information of the graph to find a good partition. geometric partitioning algorithms [23, 48, 37, ***, 38]<2> tend to be fast but often yield partitions that are worse than those obtained by spectral methods. among the most prominent of these schemes is the algorithm described in [37, ***]<2>. geometric partitioning algorithms [23, 48, 37, ***, 38]<2> tend to be fast but often yield partitions that are worse than those obtained by spectral methods. among the most prominent of these schemes is the algorithm described in [37, ***]<2>. this algorithm produces partitions that are provably within the bounds that exist for some special classes of graphs (that includes graphs arising in finite element applications). in the absence of extensive data, we could not have done any better anyway. in table 9 we show three di erent variations of spectral partitioning [45, 47, 26, 2]<2>, the multilevel partitioning described in this paper, the levelized nested dissection [11]<2>, the kl partition [31]<2>, the coordinate nested dissection (cnd) [23]<2>, two variations of the inertial partition [38, 25]<2>, and two variants of geometric partitioning [37, ***, 15]<2>. for each graph partitioning algorithm, table 9 shows a number of characteristics influence:1 type:2 pair index:128 citer id:52 citer title:a fast and high quality multilevel scheme for partitioning irregular graphs citer abstract:recently, a number of researchers have investigated a class of graph partitioning algorithms that reduce the size of the graph by collapsing vertices and edges, partition the smaller graph, and then uncoarsen it to construct a partition for the original graph . from the early work it was clear that multilevel techniques held great promise; however, it was not known if they can be made to consistently produce high quality partitions for graphs arising in a wide range of application domains. we investigate the e ectiveness of many di erent choices for all three phases: coarsening, partition of the coarsest graph, and refinement. in particular, we present a new coarsening heuristic (called heavy-edge heuristic) for which the size of the partition of the coarse graph is within a small factor of the size of the final partition obtained after multilevel refinement. we also present a much faster variation of the kernighan{lin (kl) algorithm for refining during uncoarsening. we test our scheme on a large number of graphs arising in various domains including finite element methods, linear programming, vlsi, and transportation. our experiments show that our scheme produces partitions that are consistently better than those produced by spectral partitioning schemes in substantially smaller time. also, when our scheme is used to compute fill-reducing orderings for sparse matrices, it produces orderings that have substantially smaller fill than the widely used multiple minimum degree algorithm citee id:76 citee title:a unified geometric approach to graph separators citee abstract:a class of graphs called k-overlap graphs is proposed. special cases of k-overlap graphs include planar graphs, k-nearest neighbor graphs, and earlier classes of graphs associated with finite element methods. a separator bound is proved for k-overlap graphs embedded in d dimensions. the result unifies several earlier separator results. all the arguments are based on geometric properties of embedding. the separator bounds come with randomized linear-time and randomized nc algorithms. moreover, the bounds are the best possible up to the leading term surrounding text:another class of graph partitioning techniques uses the geometric information of the graph to find a good partition. geometric partitioning algorithms [23, 48, ***, 36, 38]<2> tend to be fast but often yield partitions that are worse than those obtained by spectral methods. among the most prominent of these schemes is the algorithm described in [***, 36]<2>. geometric partitioning algorithms [23, 48, ***, 36, 38]<2> tend to be fast but often yield partitions that are worse than those obtained by spectral methods. among the most prominent of these schemes is the algorithm described in [***, 36]<2>. this algorithm produces partitions that are provably within the bounds that exist for some special classes of graphs (that includes graphs arising in finite element applications). in the absence of extensive data, we could not have done any better anyway. in table 9 we show three di erent variations of spectral partitioning [45, 47, 26, 2]<2>, the multilevel partitioning described in this paper, the levelized nested dissection [11]<2>, the kl partition [31]<2>, the coordinate nested dissection (cnd) [23]<2>, two variations of the inertial partition [38, 25]<2>, and two variants of geometric partitioning [***, 36, 15]<2>. for each graph partitioning algorithm, table 9 shows a number of characteristics influence:1 type:2 pair index:129 citer id:52 citer title:a fast and high quality multilevel scheme for partitioning irregular graphs citer abstract:recently, a number of researchers have investigated a class of graph partitioning algorithms that reduce the size of the graph by collapsing vertices and edges, partition the smaller graph, and then uncoarsen it to construct a partition for the original graph . from the early work it was clear that multilevel techniques held great promise; however, it was not known if they can be made to consistently produce high quality partitions for graphs arising in a wide range of application domains. we investigate the e ectiveness of many di erent choices for all three phases: coarsening, partition of the coarsest graph, and refinement. in particular, we present a new coarsening heuristic (called heavy-edge heuristic) for which the size of the partition of the coarse graph is within a small factor of the size of the final partition obtained after multilevel refinement. we also present a much faster variation of the kernighan{lin (kl) algorithm for refining during uncoarsening. we test our scheme on a large number of graphs arising in various domains including finite element methods, linear programming, vlsi, and transportation. our experiments show that our scheme produces partitions that are consistently better than those produced by spectral partitioning schemes in substantially smaller time. also, when our scheme is used to compute fill-reducing orderings for sparse matrices, it produces orderings that have substantially smaller fill than the widely used multiple minimum degree algorithm citee id:77 citee title:on the validity of a front-oriented approach to partitioning large sparse graphs with a connectivity constraint citee abstract:in this paper we consider the problem of partitioning large sparse graphs, such as finite element meshes. the heuristic which is proposed allows to partition into connected and quasi-balanced subgraphs in a reasonable amount of time, while attempting to minimize the number of edge cuts. here the goal is to build partitions for graphs containing large numbers of nodes and edges, in practice at least 104. basically, the algorithm relies on the iterative construction of connected subgraphs. this construction is achieved by successively exploring clusters of nodes called fronts. indeed, a judicious use of fronts ensures the connectivity of the subsets at low cost: it is shown that locally, i.e. for a given subgraph, the complexity of such operations grows at most linearly with the number of edges. moreover, a few examples are given to illustrate the quality and speed of the heuristic. surrounding text:graph growing partitioning algorithm (ggp). another simple way of bisecting the graph is to start from a vertex and grow a region around it in a breath-first fashion, until half of the vertices have been included (or half of the total vertex weight) [12, 17, ***]<2>. the quality of the ggp is sensitive to the choice of a vertex from which to start growing the graph, and di erent starting vertices yield di erent edge-cuts influence:1 type:2 pair index:130 citer id:52 citer title:a fast and high quality multilevel scheme for partitioning irregular graphs citer abstract:recently, a number of researchers have investigated a class of graph partitioning algorithms that reduce the size of the graph by collapsing vertices and edges, partition the smaller graph, and then uncoarsen it to construct a partition for the original graph . from the early work it was clear that multilevel techniques held great promise; however, it was not known if they can be made to consistently produce high quality partitions for graphs arising in a wide range of application domains. we investigate the e ectiveness of many di erent choices for all three phases: coarsening, partition of the coarsest graph, and refinement. in particular, we present a new coarsening heuristic (called heavy-edge heuristic) for which the size of the partition of the coarse graph is within a small factor of the size of the final partition obtained after multilevel refinement. we also present a much faster variation of the kernighan{lin (kl) algorithm for refining during uncoarsening. we test our scheme on a large number of graphs arising in various domains including finite element methods, linear programming, vlsi, and transportation. our experiments show that our scheme produces partitions that are consistently better than those produced by spectral partitioning schemes in substantially smaller time. also, when our scheme is used to compute fill-reducing orderings for sparse matrices, it produces orderings that have substantially smaller fill than the widely used multiple minimum degree algorithm citee id:78 citee title:combinatorial optimization citee abstract:this comprehensive textbook on combinatorial optimization puts special emphasis on theoretical results and algorithms with provably good performance, in contrast to heuristics. it has arisen as the basis of several courses on combinatorial optimization and more special topics at graduate level. since the complete book contains enough material for at least four semesters (4 hours a week), one usually selects material in a suitable way. the book contains complete but concise proofs, also for many deep results, some of which did not appear in a book before. many very recent topics are covered as well, and many references are provided. thus this book represents the state of the art of combinatorial optimization. this third edition contains a new chapter on facility location problems, an area which has been extremely active in the past few years. furthermore there are several new sections and further material on various topics. new exercises and updates in the bibliography were added. from the reviews of the 2nd edition: "this book on combinatorial optimization is a beautiful example of the ideal textbook." operations resarch letters 33 (2005), p.216-217 "the second edition (with corrections and many updates) of this very recommendable book documents the relevant knowledge on combinatorial optimization and records those problems and algorithms that define this discipline today. to read this is very stimulating for all the researchers, practitioners, and students interested in combinatorial optimization. surrounding text:the maximal matching that has the maximum number of edges is called maximum matching. however, because the complexity of computing a maximum matching [***]<1> is in general higher than that of computing a maximal matching, the latter are preferred. coarsening a graph using matchings preserves many properties of the original graph influence:3 type:2 pair index:131 citer id:52 citer title:a fast and high quality multilevel scheme for partitioning irregular graphs citer abstract:recently, a number of researchers have investigated a class of graph partitioning algorithms that reduce the size of the graph by collapsing vertices and edges, partition the smaller graph, and then uncoarsen it to construct a partition for the original graph . from the early work it was clear that multilevel techniques held great promise; however, it was not known if they can be made to consistently produce high quality partitions for graphs arising in a wide range of application domains. we investigate the e ectiveness of many di erent choices for all three phases: coarsening, partition of the coarsest graph, and refinement. in particular, we present a new coarsening heuristic (called heavy-edge heuristic) for which the size of the partition of the coarse graph is within a small factor of the size of the final partition obtained after multilevel refinement. we also present a much faster variation of the kernighan{lin (kl) algorithm for refining during uncoarsening. we test our scheme on a large number of graphs arising in various domains including finite element methods, linear programming, vlsi, and transportation. our experiments show that our scheme produces partitions that are consistently better than those produced by spectral partitioning schemes in substantially smaller time. also, when our scheme is used to compute fill-reducing orderings for sparse matrices, it produces orderings that have substantially smaller fill than the widely used multiple minimum degree algorithm citee id:79 citee title:computing the block triangular form of a sparse matrix citee abstract:we consider the problem of permuting the rows and columns of a rectangular or square, unsymmetric sparse matrix to compute its block triangular form. this block triangular form is based on a canonical decomposition of bipartite graphs induced by a maximum matching and was discovered by dulmage and mendelsohn. we describe implementations of algorithms to compute the block triangular form and provide computational results on sparse matrices from test collections. several applications of the block triangular form are also included. surrounding text:both a and b are ordered by recursively applying nested dissection ordering. in our multilevel nested dissection (mlnd) algorithm a vertex separator is computed from an edge separator by finding the minimum vertex cover [41, ***]<3>. the minimum vertex cover has been found to produce very small vertex separators influence:3 type:2 pair index:132 citer id:52 citer title:a fast and high quality multilevel scheme for partitioning irregular graphs citer abstract:recently, a number of researchers have investigated a class of graph partitioning algorithms that reduce the size of the graph by collapsing vertices and edges, partition the smaller graph, and then uncoarsen it to construct a partition for the original graph . from the early work it was clear that multilevel techniques held great promise; however, it was not known if they can be made to consistently produce high quality partitions for graphs arising in a wide range of application domains. we investigate the e ectiveness of many di erent choices for all three phases: coarsening, partition of the coarsest graph, and refinement. in particular, we present a new coarsening heuristic (called heavy-edge heuristic) for which the size of the partition of the coarse graph is within a small factor of the size of the final partition obtained after multilevel refinement. we also present a much faster variation of the kernighan{lin (kl) algorithm for refining during uncoarsening. we test our scheme on a large number of graphs arising in various domains including finite element methods, linear programming, vlsi, and transportation. our experiments show that our scheme produces partitions that are consistently better than those produced by spectral partitioning schemes in substantially smaller time. also, when our scheme is used to compute fill-reducing orderings for sparse matrices, it produces orderings that have substantially smaller fill than the widely used multiple minimum degree algorithm citee id:80 citee title:partitioning sparse matrices with eigenvectors of graphs citee abstract:the problem of computing a small vertex separator in a graph arises in the context of computing a good ordering for the parallel factorization of sparse, symmetric matrices. an algebraic approach to computing vertex separators is considered in this paper. it is shown that lower bounds on separator sizes can be obtained in terms of the eigenvalues of the laplacian matrix associated with a graph. the laplacian eigenvectors of grid graphs can be computed from kronecker products involving the eigenvectors of path graphs, and these eigenvectors can be used to compute good separators in grid graphs. a heuristic algorithm is designed to compute a vertex separator in a general graph by first computing an edge separator in the graph from an eigenvector of the laplaeian matrix, and then using a maximum matching in a subgraph to compute the vertex separator. results on the quality of the separators computed by the spectral algorithm are presented, and these are compared with separators obtained from automatic nested dissection and the kernighan-lin algorithm. finally, we report the time required to compute the laplacian eigenvector, and consider the accuracy with which the eigenvector must be computed to obtain good separators. the spectral algorithm has'the advantage that it can be implemented on a medium size multiprocessor in a straight forward manner. surrounding text:the graph partitioning problem is np-complete, however, many algorithms have been developed that find a reasonably good partition. spectral partitioning methods are known to produce good partitions for a wide class of problems, and they are used quite extensively [***, 47, 24]<2>. however, these methods are very expensive since they require the computation of the eigenvector corresponding to the second smallest eigenvalue (fiedler vector). in the absence of extensive data, we could not have done any better anyway. in table 9 we show three di erent variations of spectral partitioning [***, 47, 26, 2]<2>, the multilevel partitioning described in this paper, the levelized nested dissection [11]<2>, the kl partition [31]<2>, the coordinate nested dissection (cnd) [23]<2>, two variations of the inertial partition [38, 25]<2>, and two variants of geometric partitioning [37, 36, 15]<2>. for each graph partitioning algorithm, table 9 shows a number of characteristics influence:1 type:2 pair index:133 citer id:52 citer title:a fast and high quality multilevel scheme for partitioning irregular graphs citer abstract:recently, a number of researchers have investigated a class of graph partitioning algorithms that reduce the size of the graph by collapsing vertices and edges, partition the smaller graph, and then uncoarsen it to construct a partition for the original graph . from the early work it was clear that multilevel techniques held great promise; however, it was not known if they can be made to consistently produce high quality partitions for graphs arising in a wide range of application domains. we investigate the e ectiveness of many di erent choices for all three phases: coarsening, partition of the coarsest graph, and refinement. in particular, we present a new coarsening heuristic (called heavy-edge heuristic) for which the size of the partition of the coarse graph is within a small factor of the size of the final partition obtained after multilevel refinement. we also present a much faster variation of the kernighan{lin (kl) algorithm for refining during uncoarsening. we test our scheme on a large number of graphs arising in various domains including finite element methods, linear programming, vlsi, and transportation. our experiments show that our scheme produces partitions that are consistently better than those produced by spectral partitioning schemes in substantially smaller time. also, when our scheme is used to compute fill-reducing orderings for sparse matrices, it produces orderings that have substantially smaller fill than the widely used multiple minimum degree algorithm citee id:81 citee title:spectral nested dissection citee abstract:fi we describe a spectral nested dissection algorithm for computing orderings appropriate forparallel factorization of sparse, symmetric matricesfi the algorithm makes use of spectral properties of thelaplacian matrix associated with the given matrix to compute separatorsfi we evaluate the quality of thespectral orderings with respect to several measures: fill, elimination tree height, height and weight balancesof elimination trees, and clique tree heightsfi spectral orderings compare quite surrounding text:in [30]<1> we present a parallel formulation of our mlnd algorithm that achieves a speedup of as much as 57 on 128-processor cray t3d (over the serial algorithm running on a single t3d processor) for some graphs. spectral nested dissection (snd) [***]<1> can be used for ordering matrices for parallel factorization. the snd algorithm is based on the spectral graph partitioning algorithm described in section 4. 1. we have implemented the snd algorithm described in [***]<1>. as in the case of mlnd, the minimum vertex cover algorithm was used to compute a vertex separator from the edge separator influence:1 type:1 pair index:134 citer id:52 citer title:a fast and high quality multilevel scheme for partitioning irregular graphs citer abstract:recently, a number of researchers have investigated a class of graph partitioning algorithms that reduce the size of the graph by collapsing vertices and edges, partition the smaller graph, and then uncoarsen it to construct a partition for the original graph . from the early work it was clear that multilevel techniques held great promise; however, it was not known if they can be made to consistently produce high quality partitions for graphs arising in a wide range of application domains. we investigate the e ectiveness of many di erent choices for all three phases: coarsening, partition of the coarsest graph, and refinement. in particular, we present a new coarsening heuristic (called heavy-edge heuristic) for which the size of the partition of the coarse graph is within a small factor of the size of the final partition obtained after multilevel refinement. we also present a much faster variation of the kernighan{lin (kl) algorithm for refining during uncoarsening. we test our scheme on a large number of graphs arising in various domains including finite element methods, linear programming, vlsi, and transportation. our experiments show that our scheme produces partitions that are consistently better than those produced by spectral partitioning schemes in substantially smaller time. also, when our scheme is used to compute fill-reducing orderings for sparse matrices, it produces orderings that have substantially smaller fill than the widely used multiple minimum degree algorithm citee id:82 citee title:towards a fast implementation of spectral nested dissection citee abstract:the authors describe the novel spectral nested dissection (snd) algorithm, a novel algorithm for computing orderings appropriate for parallel factorization of sparse, symmetric matrices. the algorithm makes use of spectral properties of the laplacian matrix associated with the given matrix to compute separators. the authors evaluate the quality of the spectral orderings with respect to several measures: fill, elimination tree height, height and weight balances of elimination trees, and clique tree heights. they use some very large structural analysis problems as test cases and demonstrate on these real applications that spectral orderings compare quite favorably with commonly used orderings, outperforming them by a wide margin for some of these measures. the only disadvantage of snd is its relatively long execution time surrounding text:the graph partitioning problem is np-complete, however, many algorithms have been developed that find a reasonably good partition. spectral partitioning methods are known to produce good partitions for a wide class of problems, and they are used quite extensively [45, ***, 24]<2>. however, these methods are very expensive since they require the computation of the eigenvector corresponding to the second smallest eigenvalue (fiedler vector). in the absence of extensive data, we could not have done any better anyway. in table 9 we show three di erent variations of spectral partitioning [45, ***, 26, 2]<2>, the multilevel partitioning described in this paper, the levelized nested dissection [11]<2>, the kl partition [31]<2>, the coordinate nested dissection (cnd) [23]<2>, two variations of the inertial partition [38, 25]<2>, and two variants of geometric partitioning [37, 36, 15]<2>. for each graph partitioning algorithm, table 9 shows a number of characteristics influence:1 type:2 pair index:135 citer id:94 citer title:a highly adaptive distributed routing algorithm for mobile wireless networks citer abstract:we present a new distributed routing protocol for mobile, multihop, wireless networks. the protocol is one of a family of protocols which we term ��link reversal�� algorithms. the protocol��s reaction is structured as a temporally-ordered sequence of diffusing computations; each computation consisting of a sequence of directed link reversals. the protocol is highly adaptive, efficient and scalable; being best-suited for use in large, dense, mobile networks. in these networks, the protocol��s reaction to link failures typically involves only a localized ��single pass�� of the distributed algorithm. this capability is unique among protocols which are stable in the face of network partitions, and results in the protocol��s high degree of adaptivity . this desirable behavior is achieved through the novel use of a ��physical or logical clock�� to establish the ��temporal order�� of topological change events which is used to structure (or order) the algorithm��s reaction to topological changes. we refer to the protocol as the temporally-ordered routing algorithm (tora) citee id:95 citee title:architectural considerations for mobile mesh networking citee abstract:mobile communications is central to many military operations and is necessary to communicate simultaneously with multiple warfighters engaged in a common task. this paper describes the problem of routing and resource reservation in mobile mesh networks and presents architectural recommendations necessary for internet protocols to operate effectively in these environments. a ��mobile mesh�� network is an autonomous system of mobile routers connected by wireless links, the union of which form an arbitrary graph. the routers are free to move randomly; thus, the network's wireless topology may change rapidly and unpredictably surrounding text:1. 0 introduction we consider the problem of routing in a mobile wireless network as described in [***]<3>. such a network can be envisioned as a collection of routers (equipped with wireless receiver/transmitters) which are free to move about arbitrarily influence:2 type:3 pair index:136 citer id:94 citer title:a highly adaptive distributed routing algorithm for mobile wireless networks citer abstract:we present a new distributed routing protocol for mobile, multihop, wireless networks. the protocol is one of a family of protocols which we term ��link reversal�� algorithms. the protocol��s reaction is structured as a temporally-ordered sequence of diffusing computations; each computation consisting of a sequence of directed link reversals. the protocol is highly adaptive, efficient and scalable; being best-suited for use in large, dense, mobile networks. in these networks, the protocol��s reaction to link failures typically involves only a localized ��single pass�� of the distributed algorithm. this capability is unique among protocols which are stable in the face of network partitions, and results in the protocol��s high degree of adaptivity . this desirable behavior is achieved through the novel use of a ��physical or logical clock�� to establish the ��temporal order�� of topological change events which is used to structure (or order) the algorithm��s reaction to topological changes. we refer to the protocol as the temporally-ordered routing algorithm (tora) citee id:49 citee title:a failsafe distributed routing protocol citee abstract:an algorithm for constructing and adaptively maintaining routing tables in communication networks is presented. the algorithm can be employed in message as well as circuit switching networks, uses distributed computation, provides routing tables that are loop-free for each destination at all times, adapts to changes in network flows, and is completely failsafe. the latter means that after arbitrary failures and additions, the network recovers in finite time in the sense of providing routing paths between all physically connected nodes. for each destination, the routes are independently updated by an update cycle triggered by the destination. surrounding text:congested links are also an expected characteristic of such a network as wireless links inherently have significantly lower capacity than hardwired links and are therefore more prone to congestion. existing shortest-path algorithms [2]<2> and adaptive shortest-path algorithms [***,4,5,6,7,8,9]<2> are not particularly wellsuited for operation in such a network. these algorithms are designed for operation in static or quasi-static networks with hardwired links influence:2 type:2 pair index:137 citer id:94 citer title:a highly adaptive distributed routing algorithm for mobile wireless networks citer abstract:we present a new distributed routing protocol for mobile, multihop, wireless networks. the protocol is one of a family of protocols which we term ��link reversal�� algorithms. the protocol��s reaction is structured as a temporally-ordered sequence of diffusing computations; each computation consisting of a sequence of directed link reversals. the protocol is highly adaptive, efficient and scalable; being best-suited for use in large, dense, mobile networks. in these networks, the protocol��s reaction to link failures typically involves only a localized ��single pass�� of the distributed algorithm. this capability is unique among protocols which are stable in the face of network partitions, and results in the protocol��s high degree of adaptivity . this desirable behavior is achieved through the novel use of a ��physical or logical clock�� to establish the ��temporal order�� of topological change events which is used to structure (or order) the algorithm��s reaction to topological changes. we refer to the protocol as the temporally-ordered routing algorithm (tora) citee id:96 citee title:a responsive distributed routing algorithm for computer networks citee abstract:a new distributed algorithm is presented for dynamically determining weighted shortest paths used for message routing in computer networks. the major features of the algorithm are that the paths defined do not form transient loops when weights change and the number of steps required to find new shortest paths when network links fail is less than for previous algorithms. specifically, the worst case recovery time is proportional to the largest number of hopshin any of the weighted shortest paths. for previous loop-free distributed algorithms this recovery time is proportional to h2. surrounding text:congested links are also an expected characteristic of such a network as wireless links inherently have significantly lower capacity than hardwired links and are therefore more prone to congestion. existing shortest-path algorithms [2]<2> and adaptive shortest-path algorithms [3,***,5,6,7,8,9]<2> are not particularly wellsuited for operation in such a network. these algorithms are designed for operation in static or quasi-static networks with hardwired links influence:2 type:2 pair index:138 citer id:94 citer title:a highly adaptive distributed routing algorithm for mobile wireless networks citer abstract:we present a new distributed routing protocol for mobile, multihop, wireless networks. the protocol is one of a family of protocols which we term ��link reversal�� algorithms. the protocol��s reaction is structured as a temporally-ordered sequence of diffusing computations; each computation consisting of a sequence of directed link reversals. the protocol is highly adaptive, efficient and scalable; being best-suited for use in large, dense, mobile networks. in these networks, the protocol��s reaction to link failures typically involves only a localized ��single pass�� of the distributed algorithm. this capability is unique among protocols which are stable in the face of network partitions, and results in the protocol��s high degree of adaptivity . this desirable behavior is achieved through the novel use of a ��physical or logical clock�� to establish the ��temporal order�� of topological change events which is used to structure (or order) the algorithm��s reaction to topological changes. we refer to the protocol as the temporally-ordered routing algorithm (tora) citee id:97 citee title:another adaptive shortest-path algorithm citee abstract:the authors give a distributed algorithm to compute shortest paths in a network with changing topology. the authors analyze its behavior. the proof of correctness is discussed. it does not suffer from the routing table looping behavior associated with the ford-bellman distributed shortest path algorithm although it uses truly distributed processing. its time and message complexities are evaluated. comparisons with other methods are given surrounding text:congested links are also an expected characteristic of such a network as wireless links inherently have significantly lower capacity than hardwired links and are therefore more prone to congestion. existing shortest-path algorithms [2]<2> and adaptive shortest-path algorithms [3,4,***,6,7,8,9]<2> are not particularly wellsuited for operation in such a network. these algorithms are designed for operation in static or quasi-static networks with hardwired links influence:2 type:2 pair index:139 citer id:94 citer title:a highly adaptive distributed routing algorithm for mobile wireless networks citer abstract:we present a new distributed routing protocol for mobile, multihop, wireless networks. the protocol is one of a family of protocols which we term ��link reversal�� algorithms. the protocol��s reaction is structured as a temporally-ordered sequence of diffusing computations; each computation consisting of a sequence of directed link reversals. the protocol is highly adaptive, efficient and scalable; being best-suited for use in large, dense, mobile networks. in these networks, the protocol��s reaction to link failures typically involves only a localized ��single pass�� of the distributed algorithm. this capability is unique among protocols which are stable in the face of network partitions, and results in the protocol��s high degree of adaptivity . this desirable behavior is achieved through the novel use of a ��physical or logical clock�� to establish the ��temporal order�� of topological change events which is used to structure (or order) the algorithm��s reaction to topological changes. we refer to the protocol as the temporally-ordered routing algorithm (tora) citee id:98 citee title:distributed routing with labeled distances citee abstract:the author presents, verifies, and analyzes a new routing algorithm called the labeled distance-vector routing algorithm (ldr), that is loop-free at every instant, eliminates the counting-to-infinity problem of the distributed bellman-ford (dbf) algorithm, operates with arbitrary link and node delays, and provides shortest paths a finite time after the occurrence of an arbitrary sequence of topological changes. in contrast to previous successful approaches to loop-free routing, ldr maintains dbf's row-independence property and does not require internodal coordination spanning multiple loops. the new algorithm is shown to be loop-free and to converge in a finite time after an arbitrary sequence of topological changes. its performance is compared with the performance of other distributed routing algorithms surrounding text:congested links are also an expected characteristic of such a network as wireless links inherently have significantly lower capacity than hardwired links and are therefore more prone to congestion. existing shortest-path algorithms [2]<2> and adaptive shortest-path algorithms [3,4,5,***,7,8,9]<2> are not particularly wellsuited for operation in such a network. these algorithms are designed for operation in static or quasi-static networks with hardwired links influence:2 type:2 pair index:140 citer id:94 citer title:a highly adaptive distributed routing algorithm for mobile wireless networks citer abstract:we present a new distributed routing protocol for mobile, multihop, wireless networks. the protocol is one of a family of protocols which we term ��link reversal�� algorithms. the protocol��s reaction is structured as a temporally-ordered sequence of diffusing computations; each computation consisting of a sequence of directed link reversals. the protocol is highly adaptive, efficient and scalable; being best-suited for use in large, dense, mobile networks. in these networks, the protocol��s reaction to link failures typically involves only a localized ��single pass�� of the distributed algorithm. this capability is unique among protocols which are stable in the face of network partitions, and results in the protocol��s high degree of adaptivity . this desirable behavior is achieved through the novel use of a ��physical or logical clock�� to establish the ��temporal order�� of topological change events which is used to structure (or order) the algorithm��s reaction to topological changes. we refer to the protocol as the temporally-ordered routing algorithm (tora) citee id:99 citee title:loop-free routing using diffusing computations citee abstract:a family of distributed algorithms for the dynamic computation of the shortest paths in a computer network or memet is presented, validated, and analyzed. according to these algorithms, each node maintains a vector with its distance to every other node. update messages from a node are sent only to its neighbors; each such message contains a dktance vector of one or more entries, and each entry specifies the length of the selected path to a network destination, as well as m indication of whether the entry constitutes an update, a query, or a reply to a previous query. the new algorithms treat the problem of distributed shortest-path routing as one of diffusing computations, which was firzt proposed by dijkztra and scholten. they improve on algorithms introduced previously by chandy and misra, jatye and moss, merlin and segatl, and the author. the new algorithms are shown to converge in finite time after an arbitrary sequence of link coat or topological changes, to be loop-free at every instan~ and to outperform all other loop-free routing algorithms previously proposed from the standpoin surrounding text:congested links are also an expected characteristic of such a network as wireless links inherently have significantly lower capacity than hardwired links and are therefore more prone to congestion. existing shortest-path algorithms [2]<2> and adaptive shortest-path algorithms [3,4,5,6,***,8,9]<2> are not particularly wellsuited for operation in such a network. these algorithms are designed for operation in static or quasi-static networks with hardwired links influence:2 type:2 pair index:141 citer id:94 citer title:a highly adaptive distributed routing algorithm for mobile wireless networks citer abstract:we present a new distributed routing protocol for mobile, multihop, wireless networks. the protocol is one of a family of protocols which we term ��link reversal�� algorithms. the protocol��s reaction is structured as a temporally-ordered sequence of diffusing computations; each computation consisting of a sequence of directed link reversals. the protocol is highly adaptive, efficient and scalable; being best-suited for use in large, dense, mobile networks. in these networks, the protocol��s reaction to link failures typically involves only a localized ��single pass�� of the distributed algorithm. this capability is unique among protocols which are stable in the face of network partitions, and results in the protocol��s high degree of adaptivity . this desirable behavior is achieved through the novel use of a ��physical or logical clock�� to establish the ��temporal order�� of topological change events which is used to structure (or order) the algorithm��s reaction to topological changes. we refer to the protocol as the temporally-ordered routing algorithm (tora) citee id:100 citee title:a loop-free path-finding algorithm: specification, verification, and complexity citee abstract:the loopfree path-finding algorithm (lpa) is presented. lpa specifies the second-to-last hop and distance to each destination to ensure termination; in addition, it uses an inter-neighbor synchronization mechanism to eliminate temporary loops. a detailed proof of lpa��s correctness is presented and its complexity is evaluated. lpa��s average performance is compared by simulation with the performance of algorithnis representative of the state of the art in distributed routing, namely an ideal link-state (ils) algorithm and a loopfree algorithm that is based on internodal coordination spanning multiple hops (dual). the simulation results show that lpa is a more scalable alternative than dual and ils in terms of the average number of steps, messages, and operations needed for each algorithm to converge after a topology change. lpa is shown to achieve loop freedom at every instant without much additional overhead over that incurred by prior algorithms based on second-to-last hop and distance information. surrounding text:congested links are also an expected characteristic of such a network as wireless links inherently have significantly lower capacity than hardwired links and are therefore more prone to congestion. existing shortest-path algorithms [2]<2> and adaptive shortest-path algorithms [3,4,5,6,7,***,9]<2> are not particularly wellsuited for operation in such a network. these algorithms are designed for operation in static or quasi-static networks with hardwired links influence:2 type:2 pair index:142 citer id:94 citer title:a highly adaptive distributed routing algorithm for mobile wireless networks citer abstract:we present a new distributed routing protocol for mobile, multihop, wireless networks. the protocol is one of a family of protocols which we term ��link reversal�� algorithms. the protocol��s reaction is structured as a temporally-ordered sequence of diffusing computations; each computation consisting of a sequence of directed link reversals. the protocol is highly adaptive, efficient and scalable; being best-suited for use in large, dense, mobile networks. in these networks, the protocol��s reaction to link failures typically involves only a localized ��single pass�� of the distributed algorithm. this capability is unique among protocols which are stable in the face of network partitions, and results in the protocol��s high degree of adaptivity . this desirable behavior is achieved through the novel use of a ��physical or logical clock�� to establish the ��temporal order�� of topological change events which is used to structure (or order) the algorithm��s reaction to topological changes. we refer to the protocol as the temporally-ordered routing algorithm (tora) citee id:101 citee title:distributed, scalable routing based on vectors of link states citee abstract:link vector algorithms (lva) are introducedfor the distributed maintenance of routing information inlarge networks and internetsfi according to an lva, eachrouter maintains a subset of the topology that correspondsto adjacent links and those links used by its neighbor routersin their preferred paths to known destinationsfi based onthat subset of topology information, the router derives itsown preferred paths and communicates the correspondinglink-state information to its neighborsfi an surrounding text:congested links are also an expected characteristic of such a network as wireless links inherently have significantly lower capacity than hardwired links and are therefore more prone to congestion. existing shortest-path algorithms [2]<2> and adaptive shortest-path algorithms [3,4,5,6,7,8,***]<2> are not particularly wellsuited for operation in such a network. these algorithms are designed for operation in static or quasi-static networks with hardwired links influence:2 type:2 pair index:143 citer id:94 citer title:a highly adaptive distributed routing algorithm for mobile wireless networks citer abstract:we present a new distributed routing protocol for mobile, multihop, wireless networks. the protocol is one of a family of protocols which we term ��link reversal�� algorithms. the protocol��s reaction is structured as a temporally-ordered sequence of diffusing computations; each computation consisting of a sequence of directed link reversals. the protocol is highly adaptive, efficient and scalable; being best-suited for use in large, dense, mobile networks. in these networks, the protocol��s reaction to link failures typically involves only a localized ��single pass�� of the distributed algorithm. this capability is unique among protocols which are stable in the face of network partitions, and results in the protocol��s high degree of adaptivity . this desirable behavior is achieved through the novel use of a ��physical or logical clock�� to establish the ��temporal order�� of topological change events which is used to structure (or order) the algorithm��s reaction to topological changes. we refer to the protocol as the temporally-ordered routing algorithm (tora) citee id:102 citee title:distributed algorithms for generating loop-free routes in networks with frequently changing topology citee abstract:we consider the problem of maintaining communication between the nodes of a data network and a central station in the presence of frequent topological changes as, for example, in mobile packet radio networks. we argue that flooding schemes have significant drawbacks for such networks, and propose ag eneral class of distributed algorithms for establishing new loop-free routes to the station for any node left without a route due to china nthge sn etwork topology. by virtue of built-in redundancy, the algorithms are typically activatedv ery infrequently and, evenw hen they are, they do not involve any communication within the portoiof nth e network that has not heen materially affected by a topological change. surrounding text:while link-state algorithms provide the capability for multipath routing, the time and communication overhead associated with maintaining full topological knowledge at each router makes them impractical for this environment as well. some existing algorithms which have been developed for this environment include the following: the gafnibertsekas (gb) algorithms [***]<2>, the lightweight mobile routing (lmr) protocol [11]<2>, the destination-sequenced distance vector (dsdv) routing protocol [12]<2>, the wireless routing protocol (wrp) [13]<2>, and the dynamic source routing (dsr) protocol [14]<2>. while these algorithms are better suited for this environment, each has its drawbacks influence:2 type:2 pair index:144 citer id:94 citer title:a highly adaptive distributed routing algorithm for mobile wireless networks citer abstract:we present a new distributed routing protocol for mobile, multihop, wireless networks. the protocol is one of a family of protocols which we term ��link reversal�� algorithms. the protocol��s reaction is structured as a temporally-ordered sequence of diffusing computations; each computation consisting of a sequence of directed link reversals. the protocol is highly adaptive, efficient and scalable; being best-suited for use in large, dense, mobile networks. in these networks, the protocol��s reaction to link failures typically involves only a localized ��single pass�� of the distributed algorithm. this capability is unique among protocols which are stable in the face of network partitions, and results in the protocol��s high degree of adaptivity . this desirable behavior is achieved through the novel use of a ��physical or logical clock�� to establish the ��temporal order�� of topological change events which is used to structure (or order) the algorithm��s reaction to topological changes. we refer to the protocol as the temporally-ordered routing algorithm (tora) citee id:34 citee title:a distributed routing algorithm for mobile wireless networks citee abstract:we present a loop-free, distributed routing protocol for mobile packet radio networks. the protocol is intended for use in networks where the rate of topological change is not so fast as to make ��flooding�� the only possible routing method, but not so slow as to make one of the existing protocols for a nearly-static topology applicable. the routing algorithm adapts asynchronously in a distributed fashion to arbitrary changes in topology in the absence of global topological knowledge. the protocol's uniqueness stems from its ability to maintain source-initiated, loop-free multipath routing only to desired destinations with minimal overhead in a randomly varying topology. the protocol's performance, measured in terms of end-to-end packet delay and throughput, is compared with that of pure flooding and an alternative algorithm which is well-suited to the high-rate topological change environment envisioned here. for each protocol, emphasis is placed on examining how these performance measures vary as a function of the rate of topological changes, network topology, and message traffic level. the results indicate the new protocol generally outperforms the alternative protocol at all rates of change for heavy traffic conditions, whereas the opposite is true for light traffic. both protocols significantly outperform flooding for all rates of change except at ultra-high rates where all algorithms collapse. the network topology, whether dense or sparsely connected, is not seen to be a major factor in the relative performance of the algorithms. surrounding text:while link-state algorithms provide the capability for multipath routing, the time and communication overhead associated with maintaining full topological knowledge at each router makes them impractical for this environment as well. some existing algorithms which have been developed for this environment include the following: the gafnibertsekas (gb) algorithms [10]<2>, the lightweight mobile routing (lmr) protocol [***]<2>, the destination-sequenced distance vector (dsdv) routing protocol [12]<2>, the wireless routing protocol (wrp) [13]<2>, and the dynamic source routing (dsr) protocol [14]<2>. while these algorithms are better suited for this environment, each has its drawbacks influence:2 type:2 pair index:145 citer id:94 citer title:a highly adaptive distributed routing algorithm for mobile wireless networks citer abstract:we present a new distributed routing protocol for mobile, multihop, wireless networks. the protocol is one of a family of protocols which we term ��link reversal�� algorithms. the protocol��s reaction is structured as a temporally-ordered sequence of diffusing computations; each computation consisting of a sequence of directed link reversals. the protocol is highly adaptive, efficient and scalable; being best-suited for use in large, dense, mobile networks. in these networks, the protocol��s reaction to link failures typically involves only a localized ��single pass�� of the distributed algorithm. this capability is unique among protocols which are stable in the face of network partitions, and results in the protocol��s high degree of adaptivity . this desirable behavior is achieved through the novel use of a ��physical or logical clock�� to establish the ��temporal order�� of topological change events which is used to structure (or order) the algorithm��s reaction to topological changes. we refer to the protocol as the temporally-ordered routing algorithm (tora) citee id:103 citee title:highly dynamic destinationsequenced distance vector routing (dsdv) for mobile computers citee abstract:an ad-hoc network is the cooperative engagement of a collection of mobile hosts without the required intervention of any centralized access point. in this paper we present an innovative design for the operation of such ad-hoc networks. the basic idea of the design is to operate each mobile host as a specialized router, which periodically advertises its view of the interconnection topology with other mobile hosts within the network. this amounts to a new sort of routing protocol. we have surrounding text:while link-state algorithms provide the capability for multipath routing, the time and communication overhead associated with maintaining full topological knowledge at each router makes them impractical for this environment as well. some existing algorithms which have been developed for this environment include the following: the gafnibertsekas (gb) algorithms [10]<2>, the lightweight mobile routing (lmr) protocol [11]<2>, the destination-sequenced distance vector (dsdv) routing protocol [***]<2>, the wireless routing protocol (wrp) [13]<2>, and the dynamic source routing (dsr) protocol [14]<2>. while these algorithms are better suited for this environment, each has its drawbacks. a possible enhancement to the protocol would be to periodically propagate refresh packets outwards from the destination, reception of which resets the reference level of all nodes to zero and restores distance significance to their di ��s. the usage of periodic, destination-initiated, route optimization was mentioned as a possible routing enhancement in [18]<2> and, later, a similar technique was developed as the major mechanism for route adaptation and maintenence in [***]<2>. besides serving as a routing enhancement, the periodic refresh guarantees that router state errors��resulting from undetectable errors in packet transmissions or other sources��do not persist for arbitrary lengths of time influence:2 type:2 pair index:146 citer id:94 citer title:a highly adaptive distributed routing algorithm for mobile wireless networks citer abstract:we present a new distributed routing protocol for mobile, multihop, wireless networks. the protocol is one of a family of protocols which we term ��link reversal�� algorithms. the protocol��s reaction is structured as a temporally-ordered sequence of diffusing computations; each computation consisting of a sequence of directed link reversals. the protocol is highly adaptive, efficient and scalable; being best-suited for use in large, dense, mobile networks. in these networks, the protocol��s reaction to link failures typically involves only a localized ��single pass�� of the distributed algorithm. this capability is unique among protocols which are stable in the face of network partitions, and results in the protocol��s high degree of adaptivity . this desirable behavior is achieved through the novel use of a ��physical or logical clock�� to establish the ��temporal order�� of topological change events which is used to structure (or order) the algorithm��s reaction to topological changes. we refer to the protocol as the temporally-ordered routing algorithm (tora) citee id:104 citee title:``an efficient routing protocol for wireless networks citee abstract:we present the wireless routing protocol (wrp). in wrp, routing nodes communicate the distance and second-to-last hop for each destination. wrp reduces the number of cases in which a temporary routing loop can occur, which accounts for its fast convergence properties. a detailed proof of correctness is presented and its performance is compared by simulation with the performance of the distributed bellman-ford algorithm (dbf), dual (a loop-free distance-vector algorithm) and an ideal link-state algorithm (ils), which represent the state of the art of internet routing. the simulation results indicate that wrp is the most efficient of the alternatives analyzed. surrounding text:while link-state algorithms provide the capability for multipath routing, the time and communication overhead associated with maintaining full topological knowledge at each router makes them impractical for this environment as well. some existing algorithms which have been developed for this environment include the following: the gafnibertsekas (gb) algorithms [10]<2>, the lightweight mobile routing (lmr) protocol [11]<2>, the destination-sequenced distance vector (dsdv) routing protocol [12]<2>, the wireless routing protocol (wrp) [***]<2>, and the dynamic source routing (dsr) protocol [14]<2>. while these algorithms are better suited for this environment, each has its drawbacks influence:1 type:2 pair index:147 citer id:94 citer title:a highly adaptive distributed routing algorithm for mobile wireless networks citer abstract:we present a new distributed routing protocol for mobile, multihop, wireless networks. the protocol is one of a family of protocols which we term ��link reversal�� algorithms. the protocol��s reaction is structured as a temporally-ordered sequence of diffusing computations; each computation consisting of a sequence of directed link reversals. the protocol is highly adaptive, efficient and scalable; being best-suited for use in large, dense, mobile networks. in these networks, the protocol��s reaction to link failures typically involves only a localized ��single pass�� of the distributed algorithm. this capability is unique among protocols which are stable in the face of network partitions, and results in the protocol��s high degree of adaptivity . this desirable behavior is achieved through the novel use of a ��physical or logical clock�� to establish the ��temporal order�� of topological change events which is used to structure (or order) the algorithm��s reaction to topological changes. we refer to the protocol as the temporally-ordered routing algorithm (tora) citee id:105 citee title:dynamic source routing in ad hoc wireless networks citee abstract:an ad hoc network is a collection of wireless mobile hosts forming a temporary network without theaid of any established infrastructure or centralized administration. in such an environment, it may benecessary for one mobile host to enlist the aid of other hosts in forwarding a packet to its destination,due to the limited range of each mobile host��s wireless transmissions. this paper presents a protocolfor routing in ad hoc networks that uses dynamic source routing. the protocol adapts quickly to routingchanges when host movement is frequent, yet requires little or no overhead during periods in whichhosts move less frequently. based on results from a packet-level simulation of mobile hosts operating inan ad hoc network, the protocol performs well over a variety of environmental conditions such as hostdensity and movement rates. for all but the highest rates of host movement simulated, the overhead ofthe protocol is quite low, falling to just 1% of total data packets transmitted for moderate movement ratesin a network of 24 mobile hosts. in all cases, the difference in length between the routes used and theoptimal route lengths is negligible, and in most cases, route lengths are on average within a factor of 1.01of optimal. surrounding text:while link-state algorithms provide the capability for multipath routing, the time and communication overhead associated with maintaining full topological knowledge at each router makes them impractical for this environment as well. some existing algorithms which have been developed for this environment include the following: the gafnibertsekas (gb) algorithms [10]<2>, the lightweight mobile routing (lmr) protocol [11]<2>, the destination-sequenced distance vector (dsdv) routing protocol [12]<2>, the wireless routing protocol (wrp) [13]<2>, and the dynamic source routing (dsr) protocol [***]<2>. while these algorithms are better suited for this environment, each has its drawbacks influence:1 type:2 pair index:148 citer id:94 citer title:a highly adaptive distributed routing algorithm for mobile wireless networks citer abstract:we present a new distributed routing protocol for mobile, multihop, wireless networks. the protocol is one of a family of protocols which we term ��link reversal�� algorithms. the protocol��s reaction is structured as a temporally-ordered sequence of diffusing computations; each computation consisting of a sequence of directed link reversals. the protocol is highly adaptive, efficient and scalable; being best-suited for use in large, dense, mobile networks. in these networks, the protocol��s reaction to link failures typically involves only a localized ��single pass�� of the distributed algorithm. this capability is unique among protocols which are stable in the face of network partitions, and results in the protocol��s high degree of adaptivity . this desirable behavior is achieved through the novel use of a ��physical or logical clock�� to establish the ��temporal order�� of topological change events which is used to structure (or order) the algorithm��s reaction to topological changes. we refer to the protocol as the temporally-ordered routing algorithm (tora) citee id:106 citee title:network time protocol citee abstract:the network time protocol (ntp) is a protocol for synchronizing the clocks of computer systems over packet-switched, variable-latency data networks. ntp uses udp port 123 as its transport layer. it is designed particularly to resist the effects of variable latency (jitter buffer). surrounding text:for now we will assume that all nodes have synchronized clocks. this could be accomplished via interface with an external time source such as the global positioning system (gps) [15]<1> or through use of an algorithm such as the network time protocol [***]<1>. as we will discuss in section 2 influence:2 type:3 pair index:149 citer id:94 citer title:a highly adaptive distributed routing algorithm for mobile wireless networks citer abstract:we present a new distributed routing protocol for mobile, multihop, wireless networks. the protocol is one of a family of protocols which we term ��link reversal�� algorithms. the protocol��s reaction is structured as a temporally-ordered sequence of diffusing computations; each computation consisting of a sequence of directed link reversals. the protocol is highly adaptive, efficient and scalable; being best-suited for use in large, dense, mobile networks. in these networks, the protocol��s reaction to link failures typically involves only a localized ��single pass�� of the distributed algorithm. this capability is unique among protocols which are stable in the face of network partitions, and results in the protocol��s high degree of adaptivity . this desirable behavior is achieved through the novel use of a ��physical or logical clock�� to establish the ��temporal order�� of topological change events which is used to structure (or order) the algorithm��s reaction to topological changes. we refer to the protocol as the temporally-ordered routing algorithm (tora) citee id:107 citee title:time, clocks, and the ordering of events in a distributed system citee abstract:the concept of one event happening before another in a distributed system is examined, and is shown to define a partial ordering of the events. a distributed algorithm is given for synchronizing a system of logical clocks which can be used to totally order the events. the use of the total ordering is illustrated with a method for solving synchronization problems. the algorithm is then specialized for synchronizing physical clocks, and a bound is derived on how far out of synchrony the clocks can become. surrounding text:the results would be the same. an excellent analysis on the ordering of events in a distributed system is provided in [***]<1>. while the details will not be covered here, suffice it to say that simply establishing the order of events does not require the use of physical clocks influence:2 type:3 pair index:150 citer id:94 citer title:a highly adaptive distributed routing algorithm for mobile wireless networks citer abstract:we present a new distributed routing protocol for mobile, multihop, wireless networks. the protocol is one of a family of protocols which we term ��link reversal�� algorithms. the protocol��s reaction is structured as a temporally-ordered sequence of diffusing computations; each computation consisting of a sequence of directed link reversals. the protocol is highly adaptive, efficient and scalable; being best-suited for use in large, dense, mobile networks. in these networks, the protocol��s reaction to link failures typically involves only a localized ��single pass�� of the distributed algorithm. this capability is unique among protocols which are stable in the face of network partitions, and results in the protocol��s high degree of adaptivity . this desirable behavior is achieved through the novel use of a ��physical or logical clock�� to establish the ��temporal order�� of topological change events which is used to structure (or order) the algorithm��s reaction to topological changes. we refer to the protocol as the temporally-ordered routing algorithm (tora) citee id:33 citee title:a distributed routing algorithm for mobile radio networks citee abstract:we present a distributed routing protocol for mobile packet radio networks. the protocol is intended for use in networks where the rate of topological change is not so fast as to make ��flooding�� the only possible routing method but not so slow as to make one of the existing protocols for a static topology applicable. the roue ing algorithm adapts asynchronously in a distributed fashion to arbitrary changes in topology in the absence of global topological knowledge. the protocol maintains a set of loop-free routes to each destination from any node that desires a route. the protocol��s performance, measured in terms of end-to-end packet delay and throughput, is compared with both that of pure flooding and an alternative algorithm which is well-suited to the mediumrate topological change environment envisioned here. the results show that when the rate of topological changes becomes very high flooding is preferable to the other alternatives. for lower rates of change, it appears that when the effects of channel access are account, ed for, the performance of the new algorithm is encouraging in that it has been generally superior to that of the alternative protocols surrounding text:a possible enhancement to the protocol would be to periodically propagate refresh packets outwards from the destination, reception of which resets the reference level of all nodes to zero and restores distance significance to their di ��s. the usage of periodic, destination-initiated, route optimization was mentioned as a possible routing enhancement in [***]<2> and, later, a similar technique was developed as the major mechanism for route adaptation and maintenence in [12]<2>. besides serving as a routing enhancement, the periodic refresh guarantees that router state errors��resulting from undetectable errors in packet transmissions or other sources��do not persist for arbitrary lengths of time influence:1 type:2 pair index:151 citer id:108 citer title:a holistic lexicon-based approach to opinion mining citer abstract:one of the important types of information on the web is the opinions expressed in the user generated content, e.g., customer reviews of products, forum posts, and blogs. in this paper, we focus on customer reviews of products. in particular, we study the problem of determining the semantic orientations (positive, negative or neutral) of opinions expressed on product features in reviews. this problem has many applications, e.g., opinion mining, summarization and search. most existing techniques utilize a list of opinion (bearing) words (also called opinion lexicon) for the purpose. opinion words are words that express desirable (e.g., great, amazing, etc.) or undesirable (e.g., bad, poor, etc) states. these approaches, however, all have some major shortcomings. in this paper, we propose a holistic lexicon-based approach to solving the problem by exploiting external evidences and linguistic conventions of natural language expressions. this approach allows the system to handle opinion words that are context dependent, which cause major difficulties for existing algorithms. it also deals with many special words, phrases and language constructs which have impacts on opinions based on their linguistic patterns. it also has an effective function for aggregating multiple conflicting opinion words in a sentence. a system, called opinion observer, based on the proposed technique has been implemented. experimental results using a benchmark product review data set and some additional reviews show that the proposed technique is highly effective. it outperforms existing methods significantly citee id:109 citee title:mining comparative sentences and relations citee abstract:this paper studies a text mining problem, comparative sentence mining (csm). a comparative sentence expresses an ordering relation between two sets of entities with respect to some common features. for example, the comparative sentence ��canon��s optics are better than those of sony and nikon�� expresses the comparative relation: (better, , , ). given a set of evaluative texts on the web, e.g., reviews, forum postings, and news articles, the task of comparative sentence mining is (1) to identify comparative sentences from the texts and (2) to extract comparative relations from the identified comparative sentences. this problem has many applications. for example, a product manufacturer wants to know customer opinions of its products in comparison with those of its competitors. in this paper, we propose two novel techniques based on two new types of sequential rules to perform the tasks. experimental evaluation has been conducted using different types of evaluative texts from the web. results show that our techniques are very promising. surrounding text:it is thus highly desirable to produce a summary of reviews [13, 21]<2> (see below and also section 3). in the past few years, many researchers studied the problem, which is called opinion mining or sentiment analysis [***, 3, 13, 15, 28, 37]<2>. the main tasks are (1) to find product features that have been commented on by reviewers and (2) to decide whether the comments are positive or negative. dictionary-based approaches use synonyms and antonyms in wordnet to determine word sentiments based on a set of seed opinion words. such approaches are studied in [***, 8, 13, 17]<2>. [13]<2> proposes the idea of opinion mining and summarization influence:3 type:2 pair index:152 citer id:108 citer title:a holistic lexicon-based approach to opinion mining citer abstract:one of the important types of information on the web is the opinions expressed in the user generated content, e.g., customer reviews of products, forum posts, and blogs. in this paper, we focus on customer reviews of products. in particular, we study the problem of determining the semantic orientations (positive, negative or neutral) of opinions expressed on product features in reviews. this problem has many applications, e.g., opinion mining, summarization and search. most existing techniques utilize a list of opinion (bearing) words (also called opinion lexicon) for the purpose. opinion words are words that express desirable (e.g., great, amazing, etc.) or undesirable (e.g., bad, poor, etc) states. these approaches, however, all have some major shortcomings. in this paper, we propose a holistic lexicon-based approach to solving the problem by exploiting external evidences and linguistic conventions of natural language expressions. this approach allows the system to handle opinion words that are context dependent, which cause major difficulties for existing algorithms. it also deals with many special words, phrases and language constructs which have impacts on opinions based on their linguistic patterns. it also has an effective function for aggregating multiple conflicting opinion words in a sentence. a system, called opinion observer, based on the proposed technique has been implemented. experimental results using a benchmark product review data set and some additional reviews show that the proposed technique is highly effective. it outperforms existing methods significantly citee id:110 citee title:automatic construction of polarity-tagged corpus from html documents citee abstract:this paper proposes a novel method of building polarity-tagged corpus from html documents. the characteristics of this method is that it is fully automatic and can be applied to arbitrary html documents. the idea behind our method is to utilize certain layout structures and linguistic pattern. by using them, we can automatically extract such sentences that express opinion. in our experiment, the method could construct a corpus consisting of 126,610 sentences. surrounding text:it is thus highly desirable to produce a summary of reviews [13, 21]<2> (see below and also section 3). in the past few years, many researchers studied the problem, which is called opinion mining or sentiment analysis [1, 3, 13, ***, 28, 37]<2>. the main tasks are (1) to find product features that have been commented on by reviewers and (2) to decide whether the comments are positive or negative. the opinion on ��voice quality��, ��reception�� are positive, and the opinion on ��battery life�� is negative. other related works at both the document and sentence levels include those in [2, 10, ***, 16, 36]<2>. most sentence level and even document level classification methods are based on identification of opinion words or phrases. however, the system is domain specific. other recent work related to sentiment analysis includes [3, ***, 16, 18, 19, 20, 21, 22, 24, 30, 34]<2>. [14]<2> studies the extraction of comparative sentences and relations, which is different from this work as we do not deal with comparative sentences in this research influence:3 type:2 pair index:153 citer id:108 citer title:a holistic lexicon-based approach to opinion mining citer abstract:one of the important types of information on the web is the opinions expressed in the user generated content, e.g., customer reviews of products, forum posts, and blogs. in this paper, we focus on customer reviews of products. in particular, we study the problem of determining the semantic orientations (positive, negative or neutral) of opinions expressed on product features in reviews. this problem has many applications, e.g., opinion mining, summarization and search. most existing techniques utilize a list of opinion (bearing) words (also called opinion lexicon) for the purpose. opinion words are words that express desirable (e.g., great, amazing, etc.) or undesirable (e.g., bad, poor, etc) states. these approaches, however, all have some major shortcomings. in this paper, we propose a holistic lexicon-based approach to solving the problem by exploiting external evidences and linguistic conventions of natural language expressions. this approach allows the system to handle opinion words that are context dependent, which cause major difficulties for existing algorithms. it also deals with many special words, phrases and language constructs which have impacts on opinions based on their linguistic patterns. it also has an effective function for aggregating multiple conflicting opinion words in a sentence. a system, called opinion observer, based on the proposed technique has been implemented. experimental results using a benchmark product review data set and some additional reviews show that the proposed technique is highly effective. it outperforms existing methods significantly citee id:111 citee title:determining the sentiment of opinions citee abstract:identifying sentiments (the affective parts of opinions) is a challenging problem. we present a system that, given a topic, automatically finds the people who hold opinions about that topic and the sentiment of each opinion. the system contains a module for determining word sentiment and another for combining sentiments within a sentence. we experiment with various models of classifying and combining sentiment at word and sentence levels, with promising results. surrounding text:dictionary-based approaches use synonyms and antonyms in wordnet to determine word sentiments based on a set of seed opinion words. such approaches are studied in [1, 8, 13, ***]<2>. [13]<2> proposes the idea of opinion mining and summarization. it uses a lexicon-based method to determine whether the opinion expressed on a product feature is positive or negative. a related method is used in [***]<2>. these methods are improved in [28]<2> by a more sophisticated method based on relaxation labeling influence:2 type:2 pair index:154 citer id:108 citer title:a holistic lexicon-based approach to opinion mining citer abstract:one of the important types of information on the web is the opinions expressed in the user generated content, e.g., customer reviews of products, forum posts, and blogs. in this paper, we focus on customer reviews of products. in particular, we study the problem of determining the semantic orientations (positive, negative or neutral) of opinions expressed on product features in reviews. this problem has many applications, e.g., opinion mining, summarization and search. most existing techniques utilize a list of opinion (bearing) words (also called opinion lexicon) for the purpose. opinion words are words that express desirable (e.g., great, amazing, etc.) or undesirable (e.g., bad, poor, etc) states. these approaches, however, all have some major shortcomings. in this paper, we propose a holistic lexicon-based approach to solving the problem by exploiting external evidences and linguistic conventions of natural language expressions. this approach allows the system to handle opinion words that are context dependent, which cause major difficulties for existing algorithms. it also deals with many special words, phrases and language constructs which have impacts on opinions based on their linguistic patterns. it also has an effective function for aggregating multiple conflicting opinion words in a sentence. a system, called opinion observer, based on the proposed technique has been implemented. experimental results using a benchmark product review data set and some additional reviews show that the proposed technique is highly effective. it outperforms existing methods significantly citee id:112 citee title:automatic identification of pro and con reasons in online reviews citee abstract:in this paper, we present a system that automatically extracts the pros and cons from online reviews. although many approaches have been developed for extracting opinions from text, our focus here is on extracting the reasons of the opinions, which may themselves be in the form of either fact or opinion. leveraging online review sites with author-generated pros and cons, we propose a system for aligning the pros and cons to their sentences in review texts. a maximum entropy model is then trained on the resulting labeled set to subsequently extract pros and cons from online review sites that do not explicitly provide them. our experimental results show that our resulting system identifies pros and cons with 66% precision and 76% recall surrounding text:however, the system is domain specific. other recent work related to sentiment analysis includes [3, 15, 16, ***, 19, 20, 21, 22, 24, 30, 34]<2>. [14]<2> studies the extraction of comparative sentences and relations, which is different from this work as we do not deal with comparative sentences in this research influence:2 type:2 pair index:155 citer id:108 citer title:a holistic lexicon-based approach to opinion mining citer abstract:one of the important types of information on the web is the opinions expressed in the user generated content, e.g., customer reviews of products, forum posts, and blogs. in this paper, we focus on customer reviews of products. in particular, we study the problem of determining the semantic orientations (positive, negative or neutral) of opinions expressed on product features in reviews. this problem has many applications, e.g., opinion mining, summarization and search. most existing techniques utilize a list of opinion (bearing) words (also called opinion lexicon) for the purpose. opinion words are words that express desirable (e.g., great, amazing, etc.) or undesirable (e.g., bad, poor, etc) states. these approaches, however, all have some major shortcomings. in this paper, we propose a holistic lexicon-based approach to solving the problem by exploiting external evidences and linguistic conventions of natural language expressions. this approach allows the system to handle opinion words that are context dependent, which cause major difficulties for existing algorithms. it also deals with many special words, phrases and language constructs which have impacts on opinions based on their linguistic patterns. it also has an effective function for aggregating multiple conflicting opinion words in a sentence. a system, called opinion observer, based on the proposed technique has been implemented. experimental results using a benchmark product review data set and some additional reviews show that the proposed technique is highly effective. it outperforms existing methods significantly citee id:113 citee title:opinion extraction, summarization and tracking in news and blog corpora citee abstract:humans like to express their opinions and are eager to know others�� opinions. automatically mining and organizing opinions from heterogeneous information sources are very useful for individuals, organizations and even governments. opinion extraction, opinion summarization and opinion tracking are three important techniques for understanding opinions. opinion extraction mines opinions at word, sentence and document levels from articles. opinion summarization summarizes opinions of articles by telling sentiment polarities, degree and the correlated events. in this paper, both news and web blog articles are investigated. trec, ntcir and articles collected from web blogs serve as the information sources for opinion extraction. documents related to the issue of animal cloning are selected as the experimental materials. algorithms for opinion extraction at word, sentence and document level are proposed. the issue of relevant sentence selection is discussed, and then topical and opinionated information are summarized. opinion summarizations are visualized by representative sentences. text-based summaries in different languages, and from different sources, are compared. finally, an opinionated curve showing supportive and nonsupportive degree along the timeline is illustrated by an opinion tracking system. surrounding text:however, the system is domain specific. other recent work related to sentiment analysis includes [3, 15, 16, 18, 19, ***, 21, 22, 24, 30, 34]<2>. [14]<2> studies the extraction of comparative sentences and relations, which is different from this work as we do not deal with comparative sentences in this research influence:2 type:2 pair index:156 citer id:108 citer title:a holistic lexicon-based approach to opinion mining citer abstract:one of the important types of information on the web is the opinions expressed in the user generated content, e.g., customer reviews of products, forum posts, and blogs. in this paper, we focus on customer reviews of products. in particular, we study the problem of determining the semantic orientations (positive, negative or neutral) of opinions expressed on product features in reviews. this problem has many applications, e.g., opinion mining, summarization and search. most existing techniques utilize a list of opinion (bearing) words (also called opinion lexicon) for the purpose. opinion words are words that express desirable (e.g., great, amazing, etc.) or undesirable (e.g., bad, poor, etc) states. these approaches, however, all have some major shortcomings. in this paper, we propose a holistic lexicon-based approach to solving the problem by exploiting external evidences and linguistic conventions of natural language expressions. this approach allows the system to handle opinion words that are context dependent, which cause major difficulties for existing algorithms. it also deals with many special words, phrases and language constructs which have impacts on opinions based on their linguistic patterns. it also has an effective function for aggregating multiple conflicting opinion words in a sentence. a system, called opinion observer, based on the proposed technique has been implemented. experimental results using a benchmark product review data set and some additional reviews show that the proposed technique is highly effective. it outperforms existing methods significantly citee id:114 citee title:opinion observer: analyzing and comparing opinions on the web citee abstract:the web has become an excellent source for gathering consumer opinions. there are now numerous web sites containing such opinions, e.g., customer reviews of products, forums, discussion groups, and blogs. this paper focuses on online customer reviews of products. it makes two contributions. first, it proposes a novel framework for analyzing and comparing consumer opinions of competing products. a prototype system called opinion observer is also implemented. the system is such that with a single glance of its visualization, the user is able to clearly see the strengths and weaknesses of each product in the minds of consumers in terms of various product features. this comparison is useful to both potential customers and product manufacturers. for a potential customer, he/she can see a visual side-by-side and feature-by-feature comparison of consumer opinions on these products, which helps him/her to decide which product to buy. for a product manufacturer, the comparison enables it to easily gather marketing intelligence and product benchmarking information. second, a new technique based on language pattern mining is proposed to extract product features from pros and cons in a particular type of reviews. such features form the basis for the above comparison. experimental results show that the technique is highly effective and outperform existing methods significantly. surrounding text:the large number of reviews also makes it hard for product manufacturers or businesses to keep track of customer opinions and sentiments on their products and services. it is thus highly desirable to produce a summary of reviews [13, ***]<2> (see below and also section 3). in the past few years, many researchers studied the problem, which is called opinion mining or sentiment analysis [1, 3, 13, 15, 28, 37]<2>. however, the system is domain specific. other recent work related to sentiment analysis includes [3, 15, 16, 18, 19, 20, ***, 22, 24, 30, 34]<2>. [14]<2> studies the extraction of comparative sentences and relations, which is different from this work as we do not deal with comparative sentences in this research. what is important is that this is a structured summary produced from unstructured text. the summary can also be easily visualized to give a clear view of opinions on different object features from existing users [***]<2>. the rest of the paper focuses on solving problem 3 influence:1 type:2 pair index:157 citer id:108 citer title:a holistic lexicon-based approach to opinion mining citer abstract:one of the important types of information on the web is the opinions expressed in the user generated content, e.g., customer reviews of products, forum posts, and blogs. in this paper, we focus on customer reviews of products. in particular, we study the problem of determining the semantic orientations (positive, negative or neutral) of opinions expressed on product features in reviews. this problem has many applications, e.g., opinion mining, summarization and search. most existing techniques utilize a list of opinion (bearing) words (also called opinion lexicon) for the purpose. opinion words are words that express desirable (e.g., great, amazing, etc.) or undesirable (e.g., bad, poor, etc) states. these approaches, however, all have some major shortcomings. in this paper, we propose a holistic lexicon-based approach to solving the problem by exploiting external evidences and linguistic conventions of natural language expressions. this approach allows the system to handle opinion words that are context dependent, which cause major difficulties for existing algorithms. it also deals with many special words, phrases and language constructs which have impacts on opinions based on their linguistic patterns. it also has an effective function for aggregating multiple conflicting opinion words in a sentence. a system, called opinion observer, based on the proposed technique has been implemented. experimental results using a benchmark product review data set and some additional reviews show that the proposed technique is highly effective. it outperforms existing methods significantly citee id:116 citee title:examining the role of linguistic knowledge sources in the automatic identification and classification of reviews citee abstract:this paper examines two problems in document-level sentiment analysis: (1) determining whether a given document is a review or not, and (2) classifying the polarity of a review as positive or negative. we first demonstrate that review identification can be performed with high accuracy using only unigrams as features. we then examine the role of four types of simple linguistic knowledge sources in a polarity classification system surrounding text:however, the system is domain specific. other recent work related to sentiment analysis includes [3, 15, 16, 18, 19, 20, 21, 22, ***, 30, 34]<2>. [14]<2> studies the extraction of comparative sentences and relations, which is different from this work as we do not deal with comparative sentences in this research influence:2 type:2 pair index:158 citer id:108 citer title:a holistic lexicon-based approach to opinion mining citer abstract:one of the important types of information on the web is the opinions expressed in the user generated content, e.g., customer reviews of products, forum posts, and blogs. in this paper, we focus on customer reviews of products. in particular, we study the problem of determining the semantic orientations (positive, negative or neutral) of opinions expressed on product features in reviews. this problem has many applications, e.g., opinion mining, summarization and search. most existing techniques utilize a list of opinion (bearing) words (also called opinion lexicon) for the purpose. opinion words are words that express desirable (e.g., great, amazing, etc.) or undesirable (e.g., bad, poor, etc) states. these approaches, however, all have some major shortcomings. in this paper, we propose a holistic lexicon-based approach to solving the problem by exploiting external evidences and linguistic conventions of natural language expressions. this approach allows the system to handle opinion words that are context dependent, which cause major difficulties for existing algorithms. it also deals with many special words, phrases and language constructs which have impacts on opinions based on their linguistic patterns. it also has an effective function for aggregating multiple conflicting opinion words in a sentence. a system, called opinion observer, based on the proposed technique has been implemented. experimental results using a benchmark product review data set and some additional reviews show that the proposed technique is highly effective. it outperforms existing methods significantly citee id:117 citee title:seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales citee abstract:we address the rating-inference problem, wherein rather than simply decide whether a review is "thumbs up" or "thumbs down", as in previous sentiment analysis work, one must determine an author's evaluation with respect to a multi-point scale (e.g., one to five "stars"). this task represents an interesting twist on standard multi-class text categorization because there are several different degrees of similarity between class labels; for example, "three stars" is intuitively closer to "four stars" than to "one star". we first evaluate human performance at the task. then, we apply a meta-algorithm, based on a metric labeling formulation of the problem, that alters a given n-ary classifier's output in an explicit attempt to ensure that similar items receive similar labels. we show that the meta-algorithm can provide significant improvements over both multi-class and regression versions of svms when we employ a novel similarity measure appropriate to the problem. surrounding text:common pos categories in english are: noun, verb, adjective, adverb, pronoun, preposition, conjunction and interjection. in this project, we used the nlprocessor linguistic parser [***]<1> for pos tagging. idioms: apart from opinion words, there are also idioms. of reviews no. of features 1 digital camera 1 45 263 2 digital camera 2 34 191 3 cellular phone 1 49 374 4 mp3 player 95 720 5 dvd player 99 356 6 cellular phone 2 41 306 7 router 31 227 8 anti-virus software 51 179 total 445 2616 the nlprocessor system [***]<1> is used to generate pos tags. after pos tagging, our system opinion observer is applied to find orientations of opinions expressed on product features influence:3 type:3 pair index:159 citer id:108 citer title:a holistic lexicon-based approach to opinion mining citer abstract:one of the important types of information on the web is the opinions expressed in the user generated content, e.g., customer reviews of products, forum posts, and blogs. in this paper, we focus on customer reviews of products. in particular, we study the problem of determining the semantic orientations (positive, negative or neutral) of opinions expressed on product features in reviews. this problem has many applications, e.g., opinion mining, summarization and search. most existing techniques utilize a list of opinion (bearing) words (also called opinion lexicon) for the purpose. opinion words are words that express desirable (e.g., great, amazing, etc.) or undesirable (e.g., bad, poor, etc) states. these approaches, however, all have some major shortcomings. in this paper, we propose a holistic lexicon-based approach to solving the problem by exploiting external evidences and linguistic conventions of natural language expressions. this approach allows the system to handle opinion words that are context dependent, which cause major difficulties for existing algorithms. it also deals with many special words, phrases and language constructs which have impacts on opinions based on their linguistic patterns. it also has an effective function for aggregating multiple conflicting opinion words in a sentence. a system, called opinion observer, based on the proposed technique has been implemented. experimental results using a benchmark product review data set and some additional reviews show that the proposed technique is highly effective. it outperforms existing methods significantly citee id:118 citee title:thumbs up? sentiment classification using machine learning techniques citee abstract:we consider the problem of classifying documents not by topic, but by overall sentiment, e.g., determining whether a review is positive or negative. using movie reviews as data, we find that standard machine learning techniques definitively outperform human-produced baselines. however, the three machine learning methods we employed (naive bayes, maximum entropy classification, and support vector machines) do not perform as well on sentiment classification as on traditional topic-based categorization. we conclude by examining factors that make the sentiment classification problem more challenging. surrounding text:sentiment classification investigates ways to classify each review document as positive, negative, or neutral. representative works on classification at the document level include [4, 5, 9, 12, 26, ***, 29, 32]<2>. these works are different from ours as we are interested in opinions expressed on each product feature rather than the whole review influence:2 type:2 pair index:160 citer id:108 citer title:a holistic lexicon-based approach to opinion mining citer abstract:one of the important types of information on the web is the opinions expressed in the user generated content, e.g., customer reviews of products, forum posts, and blogs. in this paper, we focus on customer reviews of products. in particular, we study the problem of determining the semantic orientations (positive, negative or neutral) of opinions expressed on product features in reviews. this problem has many applications, e.g., opinion mining, summarization and search. most existing techniques utilize a list of opinion (bearing) words (also called opinion lexicon) for the purpose. opinion words are words that express desirable (e.g., great, amazing, etc.) or undesirable (e.g., bad, poor, etc) states. these approaches, however, all have some major shortcomings. in this paper, we propose a holistic lexicon-based approach to solving the problem by exploiting external evidences and linguistic conventions of natural language expressions. this approach allows the system to handle opinion words that are context dependent, which cause major difficulties for existing algorithms. it also deals with many special words, phrases and language constructs which have impacts on opinions based on their linguistic patterns. it also has an effective function for aggregating multiple conflicting opinion words in a sentence. a system, called opinion observer, based on the proposed technique has been implemented. experimental results using a benchmark product review data set and some additional reviews show that the proposed technique is highly effective. it outperforms existing methods significantly citee id:119 citee title:extracting product features and opinions from reviews citee abstract:consumers are often forced to wade through many on-line reviews in order to make an informed product choice. this paper introduces opine, an unsupervised information-extraction system which mines reviews in order to build a model of important product features, their evaluation by reviewers, and their relative quality across products.compared to previous work, opine achieves 22% higher precision (with only 3% lower recall) on the feature extraction task. opine's novel use of relaxation labeling for finding the semantic orientation of words in context leads to strong performance on the tasks of finding opinion phrases and their polarity. surrounding text:it is thus highly desirable to produce a summary of reviews [13, 21]<2> (see below and also section 3). in the past few years, many researchers studied the problem, which is called opinion mining or sentiment analysis [1, 3, 13, 15, ***, 37]<2>. the main tasks are (1) to find product features that have been commented on by reviewers and (2) to decide whether the comments are positive or negative. asking a domain expert or user to provide such knowledge is not scalable due to the huge number of products, product features and opinion words. several researchers have attempted the problem [11, 16, ***]<2>. however, their approaches still have some major limitations as we will see in the next section. a related method is used in [17]<2>. these methods are improved in [***]<2> by a more sophisticated method based on relaxation labeling. we will show in section 5 that the proposed technique performs much better than both these methods. it is more flexible. [***]<2>. also uses similar rules to compute opinion orientations based on relaxation labeling. also uses similar rules to compute opinion orientations based on relaxation labeling. however, as we will see, [***]<2> produces poorer results than the proposed method. 3. in this table, we compare three techniques: (1) the proposed new technique used in opinion observer, (2) the proposed technique without handling context dependency of opinion words, (3) the existing technique fbs in [13]<2>. in table 3, we will also compare with the opine system in [***]<2>, which improved fbs. from table 2, we observe that the new algorithm opinion observer has a much higher f-score than the existing fbs method. thus, we conclude that both the 237 score function and the handling of context dependent opinion words are very useful. table 3 compares the results of the opine system reported in [***]<2> based on the same benchmark data set (reviews of the first 5 products in table 1). it was shown in [***]<2> that opine outperforms fbs. table 3 compares the results of the opine system reported in [***]<2> based on the same benchmark data set (reviews of the first 5 products in table 1). it was shown in [***]<2> that opine outperforms fbs. here, we are only able to compare the average results as individual results for each product are not reported in [***]<2>. it was shown in [***]<2> that opine outperforms fbs. here, we are only able to compare the average results as individual results for each product are not reported in [***]<2>. it can be observed that opinion observer outperforms opine on both precision and recall. it can be observed that opinion observer outperforms opine on both precision and recall. furthermore, the new algorithm is much simpler than the relaxation labeling method used in [***]<2>. in the table, we also include the results of the fbs method on the reviews of the first 5 products influence:1 type:2 pair index:161 citer id:108 citer title:a holistic lexicon-based approach to opinion mining citer abstract:one of the important types of information on the web is the opinions expressed in the user generated content, e.g., customer reviews of products, forum posts, and blogs. in this paper, we focus on customer reviews of products. in particular, we study the problem of determining the semantic orientations (positive, negative or neutral) of opinions expressed on product features in reviews. this problem has many applications, e.g., opinion mining, summarization and search. most existing techniques utilize a list of opinion (bearing) words (also called opinion lexicon) for the purpose. opinion words are words that express desirable (e.g., great, amazing, etc.) or undesirable (e.g., bad, poor, etc) states. these approaches, however, all have some major shortcomings. in this paper, we propose a holistic lexicon-based approach to solving the problem by exploiting external evidences and linguistic conventions of natural language expressions. this approach allows the system to handle opinion words that are context dependent, which cause major difficulties for existing algorithms. it also deals with many special words, phrases and language constructs which have impacts on opinions based on their linguistic patterns. it also has an effective function for aggregating multiple conflicting opinion words in a sentence. a system, called opinion observer, based on the proposed technique has been implemented. experimental results using a benchmark product review data set and some additional reviews show that the proposed technique is highly effective. it outperforms existing methods significantly citee id:120 citee title:learning extraction patterns for subjective expressions citee abstract:this paper presents a bootstrapping process that learns linguistically rich extraction patterns for subjective (opinionated) expressions. high-precision classifiers label unannotated data to automatically create a large training set, which is then given to an extraction pattern learning algorithm. the learned patterns are then used to identify more subjective sentences. the bootstrapping process learns many subjective patterns and increases recall while maintaining high precision. surrounding text:sentiment classification investigates ways to classify each review document as positive, negative, or neutral. representative works on classification at the document level include [4, 5, 9, 12, 26, 27, ***, 32]<2>. these works are different from ours as we are interested in opinions expressed on each product feature rather than the whole review influence:3 type:2 pair index:162 citer id:108 citer title:a holistic lexicon-based approach to opinion mining citer abstract:one of the important types of information on the web is the opinions expressed in the user generated content, e.g., customer reviews of products, forum posts, and blogs. in this paper, we focus on customer reviews of products. in particular, we study the problem of determining the semantic orientations (positive, negative or neutral) of opinions expressed on product features in reviews. this problem has many applications, e.g., opinion mining, summarization and search. most existing techniques utilize a list of opinion (bearing) words (also called opinion lexicon) for the purpose. opinion words are words that express desirable (e.g., great, amazing, etc.) or undesirable (e.g., bad, poor, etc) states. these approaches, however, all have some major shortcomings. in this paper, we propose a holistic lexicon-based approach to solving the problem by exploiting external evidences and linguistic conventions of natural language expressions. this approach allows the system to handle opinion words that are context dependent, which cause major difficulties for existing algorithms. it also deals with many special words, phrases and language constructs which have impacts on opinions based on their linguistic patterns. it also has an effective function for aggregating multiple conflicting opinion words in a sentence. a system, called opinion observer, based on the proposed technique has been implemented. experimental results using a benchmark product review data set and some additional reviews show that the proposed technique is highly effective. it outperforms existing methods significantly citee id:121 citee title:toward opinion summarization: linking the sources citee abstract:we target the problem of linking source mentions that belong to the same entity (source coreference resolution), which is needed for creating opinion summaries. in this paper we describe how source coreference resolution can be transformed into standard noun phrase coreference resolution, apply a state-of-the-art coreference resolution approach to the transformed data, and evaluate on an available corpus of manually annotated opinions surrounding text:however, the system is domain specific. other recent work related to sentiment analysis includes [3, 15, 16, 18, 19, 20, 21, 22, 24, ***, 34]<2>. [14]<2> studies the extraction of comparative sentences and relations, which is different from this work as we do not deal with comparative sentences in this research influence:3 type:2 pair index:163 citer id:108 citer title:a holistic lexicon-based approach to opinion mining citer abstract:one of the important types of information on the web is the opinions expressed in the user generated content, e.g., customer reviews of products, forum posts, and blogs. in this paper, we focus on customer reviews of products. in particular, we study the problem of determining the semantic orientations (positive, negative or neutral) of opinions expressed on product features in reviews. this problem has many applications, e.g., opinion mining, summarization and search. most existing techniques utilize a list of opinion (bearing) words (also called opinion lexicon) for the purpose. opinion words are words that express desirable (e.g., great, amazing, etc.) or undesirable (e.g., bad, poor, etc) states. these approaches, however, all have some major shortcomings. in this paper, we propose a holistic lexicon-based approach to solving the problem by exploiting external evidences and linguistic conventions of natural language expressions. this approach allows the system to handle opinion words that are context dependent, which cause major difficulties for existing algorithms. it also deals with many special words, phrases and language constructs which have impacts on opinions based on their linguistic patterns. it also has an effective function for aggregating multiple conflicting opinion words in a sentence. a system, called opinion observer, based on the proposed technique has been implemented. experimental results using a benchmark product review data set and some additional reviews show that the proposed technique is highly effective. it outperforms existing methods significantly citee id:122 citee title:word sense and subjectivity citee abstract:subjectivity and meaning are both important properties of language. this paper explores their interaction, and brings empirical evidence in support of the hypotheses that (1) subjectivity is a property that can be associated with word senses, and (2) word sense disambiguation can directly benefit from subjectivity annotations. surrounding text:g. , the works in [10, 32, ***]<2>. dictionary-based approaches use synonyms and antonyms in wordnet to determine word sentiments based on a set of seed opinion words. however, the system is domain specific. other recent work related to sentiment analysis includes [3, 15, 16, 18, 19, 20, 21, 22, 24, 30, ***]<2>. [14]<2> studies the extraction of comparative sentences and relations, which is different from this work as we do not deal with comparative sentences in this research influence:2 type:2 pair index:164 citer id:108 citer title:a holistic lexicon-based approach to opinion mining citer abstract:one of the important types of information on the web is the opinions expressed in the user generated content, e.g., customer reviews of products, forum posts, and blogs. in this paper, we focus on customer reviews of products. in particular, we study the problem of determining the semantic orientations (positive, negative or neutral) of opinions expressed on product features in reviews. this problem has many applications, e.g., opinion mining, summarization and search. most existing techniques utilize a list of opinion (bearing) words (also called opinion lexicon) for the purpose. opinion words are words that express desirable (e.g., great, amazing, etc.) or undesirable (e.g., bad, poor, etc) states. these approaches, however, all have some major shortcomings. in this paper, we propose a holistic lexicon-based approach to solving the problem by exploiting external evidences and linguistic conventions of natural language expressions. this approach allows the system to handle opinion words that are context dependent, which cause major difficulties for existing algorithms. it also deals with many special words, phrases and language constructs which have impacts on opinions based on their linguistic patterns. it also has an effective function for aggregating multiple conflicting opinion words in a sentence. a system, called opinion observer, based on the proposed technique has been implemented. experimental results using a benchmark product review data set and some additional reviews show that the proposed technique is highly effective. it outperforms existing methods significantly citee id:123 citee title:towards answering opinion questions: separating facts from opinions and identifying the polarity of opinion sentences citee abstract:opinion question answering is a challenging task for natural language processing. in this paper, we discuss a necessary component for an opinion question answering system: separating opinions from fact, at both the document and sentence level. we present a bayesian classifier for discriminating between documents with a preponderance of opinions such as editorials from regular news stories, and describe three unsupervised, statistical techniques for the significantly harder task of detecting opinions at the sentence level. we also present a first model for classifying opinion sentences as positive or negative in terms of the main perspective being expressed in the opinion. results from a large collection of news stories and a human evaluation of 400 sentences are reported, indicating that we achieve very high performance in document classification (upwards of 97% precision and recall), and respectable performance in detecting opinions and classifying them at the sentence level as positive, negative, or neutral (up to 91% accuracy). surrounding text:the opinion on ��voice quality��, ��reception�� are positive, and the opinion on ��battery life�� is negative. other related works at both the document and sentence levels include those in [2, 10, 15, 16, ***]<2>. most sentence level and even document level classification methods are based on identification of opinion words or phrases influence:2 type:2 pair index:165 citer id:108 citer title:a holistic lexicon-based approach to opinion mining citer abstract:one of the important types of information on the web is the opinions expressed in the user generated content, e.g., customer reviews of products, forum posts, and blogs. in this paper, we focus on customer reviews of products. in particular, we study the problem of determining the semantic orientations (positive, negative or neutral) of opinions expressed on product features in reviews. this problem has many applications, e.g., opinion mining, summarization and search. most existing techniques utilize a list of opinion (bearing) words (also called opinion lexicon) for the purpose. opinion words are words that express desirable (e.g., great, amazing, etc.) or undesirable (e.g., bad, poor, etc) states. these approaches, however, all have some major shortcomings. in this paper, we propose a holistic lexicon-based approach to solving the problem by exploiting external evidences and linguistic conventions of natural language expressions. this approach allows the system to handle opinion words that are context dependent, which cause major difficulties for existing algorithms. it also deals with many special words, phrases and language constructs which have impacts on opinions based on their linguistic patterns. it also has an effective function for aggregating multiple conflicting opinion words in a sentence. a system, called opinion observer, based on the proposed technique has been implemented. experimental results using a benchmark product review data set and some additional reviews show that the proposed technique is highly effective. it outperforms existing methods significantly citee id:124 citee title:movie review mining and summarization citee abstract:with the flourish of the web, online review is becoming a more and more useful and important information resource for people. as a result, automatic review mining and summarization has become a hot research topic recently. different from traditional text summarization, review mining and summarization aims at extracting the features on which the reviewers express their opinions and determining whether the opinions are positive or negative. in this paper, we focus on a specific domain - movie review. a multi-knowledge based approach is proposed, which integrates wordnet, statistical analysis and movie knowledge. the experimental results show the effectiveness of the proposed approach in movie review mining and summarization. surrounding text:it is thus highly desirable to produce a summary of reviews [13, 21]<2> (see below and also section 3). in the past few years, many researchers studied the problem, which is called opinion mining or sentiment analysis [1, 3, 13, 15, 28, ***]<2>. the main tasks are (1) to find product features that have been commented on by reviewers and (2) to decide whether the comments are positive or negative. we will show in section 5 that the proposed technique performs much better than both these methods. in [***]<2>, a system is reported for analyzing movie reviews in the same framework. however, the system is domain specific influence:1 type:2 pair index:166 citer id:129 citer title:a microeconomic data mining problem: customer-oriented catalog segmentation citer abstract:the microeconomic framework for data mining assumes that an enterprise chooses a decision maximizing the overall utility over all customers where the contribution of a customer is a function of the data available on that customer. in catalog segmentation, the enterprise wants to design k product catalogs of size r that maximize the overall number of catalog products purchased. however, there are many applications where a customer, once attracted to an enterprise, would purchase more products beyond the ones contained in the catalog. therefore, in this paper, we investigate an alternative problem formulation, that we call customer-oriented catalog segmentation, where the overall utility is measured by the number of customers that have at least a specified minimum interest t in the catalogs. we formally introduce the customer-oriented catalog segmentation problem and discuss its complexity. then we investigate two different paradigms to design efficient, approximate algorithms for the customer-oriented catalog segmentation problem, greedy (deterministic) and randomized algorithms. since greedy algorithms may be trapped in a local optimum and randomized algorithms crucially depend on a reasonable initial solution, we explore a combination of these two paradigms. our experimental evaluation on synthetic and real data demonstrates that the new algorithms yield catalogs of significantly higher utility compared to classical catalog segmentation algorithms citee id:30 citee title:a data mining framework for optimal product selection in retail supermarket data: the generalized profset model citee abstract:in recent years, data mining researchers have developed efficient association rule algorithms for retail market basket analysisfi still, retailers often complain about how to adopt association rules to optimize concrete retail marketing-mix decisionsfi it is in this context that, in a previous paper, the authors have introduced a product selection model called profsetfi this model selects the most interesting products from a product assortment based on their cross-selling potential given some surrounding text:another research direction that has been inspired byth e microeconomic view of data mining is the extension of association rule mining to take into account the indirect profit of products that are frequentlypurc hased together with some other products. brijs [***]<2> proposes profset to model the cross-selling effects byid entifying ��purchase intentions�� in the transactions. lin et al [8]<2> introduce a value added model of association rule mining where the value could represent the profit, the privacyor other measures of the utility of a frequent itemset influence:2 type:2 pair index:167 citer id:129 citer title:a microeconomic data mining problem: customer-oriented catalog segmentation citer abstract:the microeconomic framework for data mining assumes that an enterprise chooses a decision maximizing the overall utility over all customers where the contribution of a customer is a function of the data available on that customer. in catalog segmentation, the enterprise wants to design k product catalogs of size r that maximize the overall number of catalog products purchased. however, there are many applications where a customer, once attracted to an enterprise, would purchase more products beyond the ones contained in the catalog. therefore, in this paper, we investigate an alternative problem formulation, that we call customer-oriented catalog segmentation, where the overall utility is measured by the number of customers that have at least a specified minimum interest t in the catalogs. we formally introduce the customer-oriented catalog segmentation problem and discuss its complexity. then we investigate two different paradigms to design efficient, approximate algorithms for the customer-oriented catalog segmentation problem, greedy (deterministic) and randomized algorithms. since greedy algorithms may be trapped in a local optimum and randomized algorithms crucially depend on a reasonable initial solution, we explore a combination of these two paradigms. our experimental evaluation on synthetic and real data demonstrates that the new algorithms yield catalogs of significantly higher utility compared to classical catalog segmentation algorithms citee id:130 citee title:a threshold of ln n for approximating set cover citee abstract:given a collectionof subsets of s{1, . . . , n}, set cover is the problem of selecting asfew as possible subsets fromsuch that their union covers s, and max k-cover is the problem ofselecting k subsets fromsuch that their union has maximum cardinality. both these problems arenp-hard. we prove that (1o(1)) ln n is a threshold below which set cover cannot beapproximated efficiently, unless np has slightly superpolynomial time algorithms. this closes the gap(up to low-order terms) between the ratio of approximation achievable by the greedy algorithm(which is (1o(1)) ln n), and previous results of lund and yannakakis, that showed hardness ofapproximation within a ratio of (log2n)/ 20.72 ln n. for max k-cover, we show an approximationthreshold of (11/e) (up to low-order terms), under the assumption that pnp.categories and subject descriptors: f.2.2 :nonnumerical algorithms and problems�ccomputations on discrete structures surrounding text:mec is a well known np-complete problem and can be easilyred uced from set cover [5]<3>. in [***]<3>, feige proved that the simple greedyalgorit hm, iterativelyselect ing the next product that covers the largest number of uncovered customers, approximates mec bya ratio of at least 1 . 1/e �� 0 influence:2 type:2 pair index:168 citer id:129 citer title:a microeconomic data mining problem: customer-oriented catalog segmentation citer abstract:the microeconomic framework for data mining assumes that an enterprise chooses a decision maximizing the overall utility over all customers where the contribution of a customer is a function of the data available on that customer. in catalog segmentation, the enterprise wants to design k product catalogs of size r that maximize the overall number of catalog products purchased. however, there are many applications where a customer, once attracted to an enterprise, would purchase more products beyond the ones contained in the catalog. therefore, in this paper, we investigate an alternative problem formulation, that we call customer-oriented catalog segmentation, where the overall utility is measured by the number of customers that have at least a specified minimum interest t in the catalogs. we formally introduce the customer-oriented catalog segmentation problem and discuss its complexity. then we investigate two different paradigms to design efficient, approximate algorithms for the customer-oriented catalog segmentation problem, greedy (deterministic) and randomized algorithms. since greedy algorithms may be trapped in a local optimum and randomized algorithms crucially depend on a reasonable initial solution, we explore a combination of these two paradigms. our experimental evaluation on synthetic and real data demonstrates that the new algorithms yield catalogs of significantly higher utility compared to classical catalog segmentation algorithms citee id:131 citee title:a microeconomic view of data mining citee abstract:we present a rigorous framework, based on optimization, for evaluating data miningoperations such as associations and clustering, in terms of their utility in decisionmakingfithis framework leads quickly to some interesting computational problemsrelated to sensitivity analysis, segmentation and the theory of gamesfidepartment of computer science, cornell university, ithaca ny 14853fi email: kleinber@csficornellfiedufisupported in part by an alfred pfi sloan research fellowship and by nsf surrounding text:sfu. ca abstract the microeconomic framework for data mining [***]<1> assumes that an enterprise chooses a decision maximizing the overall utility over all customers where the contribution of a customer is a function of the data available on that customer. in catalog segmentation, the enterprise wants to design k product catalogs of size r that maximize the overall number of catalog products purchased, however, there are many applications where a customer, once attracted to an enterprise, would purchase more products beyond the ones contained in the catalog. introduction so far, onlyfew theoretical frameworks for mining useful knowledge from data have been proposed in the literature. the microeconomic framework for data mining [***]<1> is considered as one of the most promising of these models [9]<1>. this framework considers an enterprise with a set of possible decisions and a set of customers that, depending on the decision chosen, contribute different amounts to the overall utilityof a decision from the point of view of the enterprise. 2. related work the microeconomic approach to data mining has been introduced byk leinberg et al [***]<1> formalizing the optimization problem of enterprises based on data allowing the enterprise to predict the utilityof a customer w. r influence:1 type:1 pair index:169 citer id:129 citer title:a microeconomic data mining problem: customer-oriented catalog segmentation citer abstract:the microeconomic framework for data mining assumes that an enterprise chooses a decision maximizing the overall utility over all customers where the contribution of a customer is a function of the data available on that customer. in catalog segmentation, the enterprise wants to design k product catalogs of size r that maximize the overall number of catalog products purchased. however, there are many applications where a customer, once attracted to an enterprise, would purchase more products beyond the ones contained in the catalog. therefore, in this paper, we investigate an alternative problem formulation, that we call customer-oriented catalog segmentation, where the overall utility is measured by the number of customers that have at least a specified minimum interest t in the catalogs. we formally introduce the customer-oriented catalog segmentation problem and discuss its complexity. then we investigate two different paradigms to design efficient, approximate algorithms for the customer-oriented catalog segmentation problem, greedy (deterministic) and randomized algorithms. since greedy algorithms may be trapped in a local optimum and randomized algorithms crucially depend on a reasonable initial solution, we explore a combination of these two paradigms. our experimental evaluation on synthetic and real data demonstrates that the new algorithms yield catalogs of significantly higher utility compared to classical catalog segmentation algorithms citee id:132 citee title:mining values added association rules citee abstract:value added product is an industrial term referring a minor addition to some major products. in this paper, we borrow the term to denote a minor semantic addition to the well known association rules. we consider the addition of numerical values to the attribute values, such as sale price, profit, degree of fuzziness, level of security and so on. such additions lead to the notion of random variables (as added value to attributes) in the data model and hence the probabilistic considerations of data mining. surrounding text:brijs [3]<2> proposes profset to model the cross-selling effects byid entifying ��purchase intentions�� in the transactions. lin et al [***]<2> introduce a value added model of association rule mining where the value could represent the profit, the privacyor other measures of the utility of a frequent itemset. wang et al [14]<2> present a method for proposing a target item whenever a customer purchases a non-target item influence:3 type:2 pair index:170 citer id:129 citer title:a microeconomic data mining problem: customer-oriented catalog segmentation citer abstract:the microeconomic framework for data mining assumes that an enterprise chooses a decision maximizing the overall utility over all customers where the contribution of a customer is a function of the data available on that customer. in catalog segmentation, the enterprise wants to design k product catalogs of size r that maximize the overall number of catalog products purchased. however, there are many applications where a customer, once attracted to an enterprise, would purchase more products beyond the ones contained in the catalog. therefore, in this paper, we investigate an alternative problem formulation, that we call customer-oriented catalog segmentation, where the overall utility is measured by the number of customers that have at least a specified minimum interest t in the catalogs. we formally introduce the customer-oriented catalog segmentation problem and discuss its complexity. then we investigate two different paradigms to design efficient, approximate algorithms for the customer-oriented catalog segmentation problem, greedy (deterministic) and randomized algorithms. since greedy algorithms may be trapped in a local optimum and randomized algorithms crucially depend on a reasonable initial solution, we explore a combination of these two paradigms. our experimental evaluation on synthetic and real data demonstrates that the new algorithms yield catalogs of significantly higher utility compared to classical catalog segmentation algorithms citee id:133 citee title:theoretical framework for data mining citee abstract:research in data mining and knowledge discovery in databases has mostly concentrated on developing good algorithms for various data mining tasks (see for example the recent proceedings of kdd conferences). some parts of the research effort have gone to investigating data mining process, user interface issues, database topics, or visualization . relatively little has been published about the theoretical foundations of data mining. in this paper i present some possible theoretical approaches to data mining. the area is at its infancy, and there probably are more questions than answers in this paper. surrounding text:introduction so far, onlyfew theoretical frameworks for mining useful knowledge from data have been proposed in the literature. the microeconomic framework for data mining [7]<1> is considered as one of the most promising of these models [***]<1>. this framework considers an enterprise with a set of possible decisions and a set of customers that, depending on the decision chosen, contribute different amounts to the overall utilityof a decision from the point of view of the enterprise influence:2 type:3 pair index:171 citer id:129 citer title:a microeconomic data mining problem: customer-oriented catalog segmentation citer abstract:the microeconomic framework for data mining assumes that an enterprise chooses a decision maximizing the overall utility over all customers where the contribution of a customer is a function of the data available on that customer. in catalog segmentation, the enterprise wants to design k product catalogs of size r that maximize the overall number of catalog products purchased. however, there are many applications where a customer, once attracted to an enterprise, would purchase more products beyond the ones contained in the catalog. therefore, in this paper, we investigate an alternative problem formulation, that we call customer-oriented catalog segmentation, where the overall utility is measured by the number of customers that have at least a specified minimum interest t in the catalogs. we formally introduce the customer-oriented catalog segmentation problem and discuss its complexity. then we investigate two different paradigms to design efficient, approximate algorithms for the customer-oriented catalog segmentation problem, greedy (deterministic) and randomized algorithms. since greedy algorithms may be trapped in a local optimum and randomized algorithms crucially depend on a reasonable initial solution, we explore a combination of these two paradigms. our experimental evaluation on synthetic and real data demonstrates that the new algorithms yield catalogs of significantly higher utility compared to classical catalog segmentation algorithms citee id:134 citee title:efficient algorithms for creating product catalogs citee abstract:for the purposes of this paper we define a catalog to be a promotional catalog, i.e., a collection of products (items) presented to a customer with the hope of encouraging a purchase. the single mailing problem addresses how to build a collection of catalogs and distribute them to customers (one per customer) so as to achieve an optimal outcome, e.g., the most profit. each catalog is a subset of a given set of items and different catalogs may contain some of the same items, i.e., catalogs may overlap. a slightly more general, but important extension of the single mailing problem seeks the optimal set of catalogs when multiple mailings are allowed, i.e., multiple catalogs can be sent to each customer. catalog creation has important applications for e-commerce and traditional brick-and-mortar retailers, especially when used with personalized recommender systems. the catalog creation problem is np complete and some (relatively expensive) approximation algorithms have recently been developed. in this paper we describe more efficient techniques for building catalogs and show that these algorithms outperform one of the previously suggested approaches. (indeed, the techniques previously suggested are not feasible for realistic numbers of customers and catalogs.) some of our techniques directly use the objective function, e.g., maximize profit, to find a locally optimal solution in an approach based on gradient ascent. however, by combining such techniques with the clustering of similar customers, better results can sometimes be obtained. we also analyze the performance of our algorithms with respect to a theoretical bound and show that, in some cases, their performance is close to optimal. surrounding text:the microeconomic framework for data mining has in particular been investigated for segmentation (clustering) problems where the enterprise does not make an optimal decision per individual customer but chooses one optimal decision per customer segment. catalog segmentation, a specialized segmentation problem, has received considerable attention [6, 7, ***]<2>: the enterprise wants to design k product catalogs of size r that maximize the overall customer purchases after having sent the best matching catalog to each customer1. the catalog segmentation problem measures the utilityo f a customer in terms of catalog products purchased. in the data mining community, the catalog segmentation problem has been treated as a clustering problem. steinbach et al [***]<2> show that the sampling-based enumeration algorithm[6]<2> is infeasible for realistic problem sizes. instead, theyp ropose two alternative heuristic algorithms and a hybrid algorithm (hcc) combining both of them. the real dataset records the purchasing transactions of the customers of a large canadian retailer over a period of several weeks. since the customer-oriented catalog segmentation problem has not yet been addressed in the literature, we compare our proposed algorithms with one of the state-oftheart algorithms [***]<2> for the related catalog segmentation problem. we choose dcc as our comparison partner because of the following two reasons. we choose dcc as our comparison partner because of the following two reasons. first, the experimental evaluation in [***]<2> showed that dcc, together with hcc, achieved the highest qualityre sults. second, dcc scales better to large customers databases than hcc because dcc can, different from hcc, use storage efficient adjacencyl ists instead of an adjacencymat rix influence:1 type:2 pair index:172 citer id:129 citer title:a microeconomic data mining problem: customer-oriented catalog segmentation citer abstract:the microeconomic framework for data mining assumes that an enterprise chooses a decision maximizing the overall utility over all customers where the contribution of a customer is a function of the data available on that customer. in catalog segmentation, the enterprise wants to design k product catalogs of size r that maximize the overall number of catalog products purchased. however, there are many applications where a customer, once attracted to an enterprise, would purchase more products beyond the ones contained in the catalog. therefore, in this paper, we investigate an alternative problem formulation, that we call customer-oriented catalog segmentation, where the overall utility is measured by the number of customers that have at least a specified minimum interest t in the catalogs. we formally introduce the customer-oriented catalog segmentation problem and discuss its complexity. then we investigate two different paradigms to design efficient, approximate algorithms for the customer-oriented catalog segmentation problem, greedy (deterministic) and randomized algorithms. since greedy algorithms may be trapped in a local optimum and randomized algorithms crucially depend on a reasonable initial solution, we explore a combination of these two paradigms. our experimental evaluation on synthetic and real data demonstrates that the new algorithms yield catalogs of significantly higher utility compared to classical catalog segmentation algorithms citee id:135 citee title:mpis: maximal-profit item selection with cross-selling considerations citee abstract:in the literature of data mining, many different algorithms for association rule mining have been proposed. however, there is relatively little study on how association rules can aid in more specific targets. in this paper, one of the applications for association rules - maximal-profit item selection with cross -selling effect (mpis) problem - is investigated. the problem is about selecting a subset of items which can give the maximal profit with the consideration of cross-selling. we prove that a simple version of this problem is np-hard. we propose a new approach to the problem with the consideration of the loss rule - a kind of association rule to model the cross-selling effect. we show that the problem can be transformed to a quadratic programming problem. in case quadratic programming is not applicable, we also propose a heuristic approach. experiments are conducted to show that both of the proposed methods are highly effective and efficient surrounding text:wang et al [12]<2> applyt he principle of mutual reinforcement of hub/authority web pages in order to rank items taking into account their indirect profits. addressing a similar problem, wong et al [***]<2> studythe problem of selecting a maximum profit subset of items based on modeling the cross-selling effects with association rules. while all these approaches incorporate the notion of utilityin to the process of association rule mining, theya nalyze the relationships between sets of products without considering which customers have purchased these products influence:2 type:2 pair index:173 citer id:129 citer title:a microeconomic data mining problem: customer-oriented catalog segmentation citer abstract:the microeconomic framework for data mining assumes that an enterprise chooses a decision maximizing the overall utility over all customers where the contribution of a customer is a function of the data available on that customer. in catalog segmentation, the enterprise wants to design k product catalogs of size r that maximize the overall number of catalog products purchased. however, there are many applications where a customer, once attracted to an enterprise, would purchase more products beyond the ones contained in the catalog. therefore, in this paper, we investigate an alternative problem formulation, that we call customer-oriented catalog segmentation, where the overall utility is measured by the number of customers that have at least a specified minimum interest t in the catalogs. we formally introduce the customer-oriented catalog segmentation problem and discuss its complexity. then we investigate two different paradigms to design efficient, approximate algorithms for the customer-oriented catalog segmentation problem, greedy (deterministic) and randomized algorithms. since greedy algorithms may be trapped in a local optimum and randomized algorithms crucially depend on a reasonable initial solution, we explore a combination of these two paradigms. our experimental evaluation on synthetic and real data demonstrates that the new algorithms yield catalogs of significantly higher utility compared to classical catalog segmentation algorithms citee id:136 citee title:item selection by ��hub-authority�� profit ranking citee abstract:a fundamental problem in business and other applications is ranking items with respect to some notion of profit based on historical transactions. the difficulty is that the profit of one item not only comes from its own sales, but also from its influence on the sales of other items, i.e., the "cross-selling effect". in this paper, we draw an analogy between this influence and the mutual reinforcement of hub/authority web pages. based on this analogy, we present a novel approach to the item surrounding text:this method maximizes the total profit of target items for future customers. wang et al [***]<2> applyt he principle of mutual reinforcement of hub/authority web pages in order to rank items taking into account their indirect profits. addressing a similar problem, wong et al [11]<2> studythe problem of selecting a maximum profit subset of items based on modeling the cross-selling effects with association rules influence:3 type:2 pair index:174 citer id:129 citer title:a microeconomic data mining problem: customer-oriented catalog segmentation citer abstract:the microeconomic framework for data mining assumes that an enterprise chooses a decision maximizing the overall utility over all customers where the contribution of a customer is a function of the data available on that customer. in catalog segmentation, the enterprise wants to design k product catalogs of size r that maximize the overall number of catalog products purchased. however, there are many applications where a customer, once attracted to an enterprise, would purchase more products beyond the ones contained in the catalog. therefore, in this paper, we investigate an alternative problem formulation, that we call customer-oriented catalog segmentation, where the overall utility is measured by the number of customers that have at least a specified minimum interest t in the catalogs. we formally introduce the customer-oriented catalog segmentation problem and discuss its complexity. then we investigate two different paradigms to design efficient, approximate algorithms for the customer-oriented catalog segmentation problem, greedy (deterministic) and randomized algorithms. since greedy algorithms may be trapped in a local optimum and randomized algorithms crucially depend on a reasonable initial solution, we explore a combination of these two paradigms. our experimental evaluation on synthetic and real data demonstrates that the new algorithms yield catalogs of significantly higher utility compared to classical catalog segmentation algorithms citee id:137 citee title:approximate the 2-catalog segmentation problem using semidefinite programming relaxations citee abstract:we consider the $2$-catalog segmentation problem ($2$-csp) introduced by kleinberg, papadimitriou and raghavan \cite{kpr}fi in this problem, we are given a ground set $i$ of $n$ items, a family $\{ s_1, s_2, \cdots, s_m \}$ of subsets of $i$ and an integer $1\le k\le n$fi the objective is to find subsets $a_1, a_2 \subset i$ such that $a_1=a_2 = k$ and $\sum_{i=1}^{m} \max \{s_i \cap a_1, s_i \cap a_2 \}$ is maximizedfi it is known that a simple greedy algorithm has a performance guarantee surrounding text:asodi and safra [2]<2> proved that a polynomial time ( 1 2 +)-approximation algorithm, for any constant  > 0, would imply np = p. xu et al [***]<2> developed an approximation algorithm based on semi-definite programming that has a performance guarantee of 1/2 for general r and of strictlygreat er than 1/2 for r �� n 3. in particular, 2-catalog segmentation can be approximated bya factor of 0. denotes the cardinality. the original catalog segmentation problem can be defined in a more illustrative graph-based manner (partition version) as follows [***]<2>: given a bipartite graph g = (p,c,e) with p = m and c = n , find a partition of c = research track poster 558 c1 �� c2 �� . influence:2 type:2 pair index:175 citer id:129 citer title:a microeconomic data mining problem: customer-oriented catalog segmentation citer abstract:the microeconomic framework for data mining assumes that an enterprise chooses a decision maximizing the overall utility over all customers where the contribution of a customer is a function of the data available on that customer. in catalog segmentation, the enterprise wants to design k product catalogs of size r that maximize the overall number of catalog products purchased. however, there are many applications where a customer, once attracted to an enterprise, would purchase more products beyond the ones contained in the catalog. therefore, in this paper, we investigate an alternative problem formulation, that we call customer-oriented catalog segmentation, where the overall utility is measured by the number of customers that have at least a specified minimum interest t in the catalogs. we formally introduce the customer-oriented catalog segmentation problem and discuss its complexity. then we investigate two different paradigms to design efficient, approximate algorithms for the customer-oriented catalog segmentation problem, greedy (deterministic) and randomized algorithms. since greedy algorithms may be trapped in a local optimum and randomized algorithms crucially depend on a reasonable initial solution, we explore a combination of these two paradigms. our experimental evaluation on synthetic and real data demonstrates that the new algorithms yield catalogs of significantly higher utility compared to classical catalog segmentation algorithms citee id:138 citee title:profit mining: from patterns to actions citee abstract:a major obstacle in data mining applications is the gap betweenthe statistic-based pattern extraction and the value-based decisionmakingfi we present a profit mining approach to reduce this gapfi in profitmining, we are given a set of past transactions and pre-selected targetitems, and we like to build a model for recommending target items andpromotion strategies to new customers, with the goal of maximizing thenet profitfi we identify several issues in profit mining and propose surrounding text:lin et al [8]<2> introduce a value added model of association rule mining where the value could represent the profit, the privacyor other measures of the utility of a frequent itemset. wang et al [***]<2> present a method for proposing a target item whenever a customer purchases a non-target item. this method maximizes the total profit of target items for future customers influence:3 type:2 pair index:176 citer id:135 citer title:mpis: maximal-profit item selection with cross-selling considerations citer abstract:in the literature of data mining, many different algorithms for association rule mining have been proposed. however, there is relatively little study on how association rules can aid in more specific targets. in this paper, one of the applications for association rules - maximal-profit item selection with cross -selling effect (mpis) problem - is investigated. the problem is about selecting a subset of items which can give the maximal profit with the consideration of cross-selling. we prove that a simple version of this problem is np-hard. we propose a new approach to the problem with the consideration of the loss rule - a kind of association rule to model the cross-selling effect. we show that the problem can be transformed to a quadratic programming problem. in case quadratic programming is not applicable, we also propose a heuristic approach. experiments are conducted to show that both of the proposed methods are highly effective and efficient citee id:733 citee title:mining association rules between sets of items in large databases citee abstract:we are given a large database of customer transactionsfi each transaction consists of items purchased by a customer in a visitfi we present an e cient algorithm that generates all signi cant association rules between items in the databasefi the algorithm incorporates bu er management and novel estimation and pruning techniquesfi we also present results of applying this algorithm to sales data obtained from a large retailing company, which shows the e ectiveness of the algorithmfi surrounding text:this growth of data requires sophisticated method in the analysis. at about the same time, association rule mining [***]<3> has been proposed by computer scientists, which aims at understanding the relationships among items in transactions or market baskets. however, it is generally true that the association rules in themselves do not serve the end purpose of the business people influence:2 type:2 pair index:177 citer id:135 citer title:mpis: maximal-profit item selection with cross-selling considerations citer abstract:in the literature of data mining, many different algorithms for association rule mining have been proposed. however, there is relatively little study on how association rules can aid in more specific targets. in this paper, one of the applications for association rules - maximal-profit item selection with cross -selling effect (mpis) problem - is investigated. the problem is about selecting a subset of items which can give the maximal profit with the consideration of cross-selling. we prove that a simple version of this problem is np-hard. we propose a new approach to the problem with the consideration of the loss rule - a kind of association rule to model the cross-selling effect. we show that the problem can be transformed to a quadratic programming problem. in case quadratic programming is not applicable, we also propose a heuristic approach. experiments are conducted to show that both of the proposed methods are highly effective and efficient citee id:550 citee title:fast algorithms for mining association rules citee abstract:we consider the problem of discovering association rulesbetween items in a large database of sales transactions.we present two new algorithms for solving this problemthat are fundamentally dierent from the known algo-rithms. empirical evaluation shows that these algorithmsoutperform the known algorithms by factors ranging fromthree for small problems to more than an order of mag-nitude for large problems. we also show how the bestfeatures of the two proposed algorithms can be combinedinto a hybrid algorithm, called apriorihybrid. scale-upexperiments show that apriorihybrid scales linearly withthe number of transactions. apriorihybrid also has ex-cellent scale-up properties with respect to the transactionsize and the number of items in the database surrounding text:the problem is to find all rules with sufficient support and confidence. some of the earlier work include [22, ***, 21]<2>. 3 influence:2 type:2 pair index:178 citer id:135 citer title:mpis: maximal-profit item selection with cross-selling considerations citer abstract:in the literature of data mining, many different algorithms for association rule mining have been proposed. however, there is relatively little study on how association rules can aid in more specific targets. in this paper, one of the applications for association rules - maximal-profit item selection with cross -selling effect (mpis) problem - is investigated. the problem is about selecting a subset of items which can give the maximal profit with the consideration of cross-selling. we prove that a simple version of this problem is np-hard. we propose a new approach to the problem with the consideration of the loss rule - a kind of association rule to model the cross-selling effect. we show that the problem can be transformed to a quadratic programming problem. in case quadratic programming is not applicable, we also propose a heuristic approach. experiments are conducted to show that both of the proposed methods are highly effective and efficient citee id:595 citee title:heuristic algorithms for the unconstrained binary quadratic programming problem citee abstract:in this paper we consider the unconstrained binary quadratic programming problemfi this is the problem of maximising a quadratic objective by suitable choice of binary (zero-one) variablesfi we present two heuristic algorithms based upon tabu search and simulated annealing for this problemfi computational results are presented for a number of publically available data sets involving up to 2500 variablesfi an interesting feature of our results is that whilst for most problems tabu search dominates surrounding text:p4fi4 fi any 0-1 quadratic programming problem is polynomially reducible to an unconstrained binary quadratic programming problem [16]<3>. an unconstrained binary quadratic programming problem can be transformed to a binary linear programming problem (zero-one linear programming) [***]<3>. more related properties can be found in [20]<3> and [14]<3> influence:3 type:3 pair index:179 citer id:135 citer title:mpis: maximal-profit item selection with cross-selling considerations citer abstract:in the literature of data mining, many different algorithms for association rule mining have been proposed. however, there is relatively little study on how association rules can aid in more specific targets. in this paper, one of the applications for association rules - maximal-profit item selection with cross -selling effect (mpis) problem - is investigated. the problem is about selecting a subset of items which can give the maximal profit with the consideration of cross-selling. we prove that a simple version of this problem is np-hard. we propose a new approach to the problem with the consideration of the loss rule - a kind of association rule to model the cross-selling effect. we show that the problem can be transformed to a quadratic programming problem. in case quadratic programming is not applicable, we also propose a heuristic approach. experiments are conducted to show that both of the proposed methods are highly effective and efficient citee id:531 citee title:every transaction tells a story citee abstract:financial institutions, retailers and payment processors running point-of-sale (pos) and automated teller machine (atm) terminals are looking for ways to lower their operational costs, without increasing the risk of service disruption to their end customers. many are already utilizing dial-up terminals instead of costly full-service terminals, and are looking to implement a payment solution that will: surrounding text:1 introduction recent studies in the retailing market have shown a winning edge for customer-oriented business, which is based on decision making from better knowledge about the customer behaviour. furthermore, the behaviour in terms of sales transactions is considered significant [***]<3>. this is also called market basket analysis. here we investigate the application of association rule mining on the problem of market basket analysis. as pointed out in [***]<3>, a major task of talented merchants is to pick the profit generating items and discard the losing items. it may be simple enough to sort items by their profit and do the selection influence:2 type:3 pair index:180 citer id:135 citer title:mpis: maximal-profit item selection with cross-selling considerations citer abstract:in the literature of data mining, many different algorithms for association rule mining have been proposed. however, there is relatively little study on how association rules can aid in more specific targets. in this paper, one of the applications for association rules - maximal-profit item selection with cross -selling effect (mpis) problem - is investigated. the problem is about selecting a subset of items which can give the maximal profit with the consideration of cross-selling. we prove that a simple version of this problem is np-hard. we propose a new approach to the problem with the consideration of the loss rule - a kind of association rule to model the cross-selling effect. we show that the problem can be transformed to a quadratic programming problem. in case quadratic programming is not applicable, we also propose a heuristic approach. experiments are conducted to show that both of the proposed methods are highly effective and efficient citee id:30 citee title:a data mining framework for optimal product selection in retail supermarket data: the generalized profset model citee abstract:in recent years, data mining researchers have developed efficient association rule algorithms for retail market basket analysisfi still, retailers often complain about how to adopt association rules to optimize concrete retail marketing-mix decisionsfi it is in this context that, in a previous paper, the authors have introduced a product selection model called profsetfi this model selects the most interesting products from a product assortment based on their cross-selling potential given some surrounding text:1 item selection relatedwork there are some recent works on the maximal-profit item selection problem. profset [8, ***]<2> models the cross-selling effects by frequent itemsets, which are sets of items cooccurring frequently. a maximal frequent itemset is a frequent itemset which does not have a frequent item superset influence:1 type:2 pair index:181 citer id:135 citer title:mpis: maximal-profit item selection with cross-selling considerations citer abstract:in the literature of data mining, many different algorithms for association rule mining have been proposed. however, there is relatively little study on how association rules can aid in more specific targets. in this paper, one of the applications for association rules - maximal-profit item selection with cross -selling effect (mpis) problem - is investigated. the problem is about selecting a subset of items which can give the maximal profit with the consideration of cross-selling. we prove that a simple version of this problem is np-hard. we propose a new approach to the problem with the consideration of the loss rule - a kind of association rule to model the cross-selling effect. we show that the problem can be transformed to a quadratic programming problem. in case quadratic programming is not applicable, we also propose a heuristic approach. experiments are conducted to show that both of the proposed methods are highly effective and efficient citee id:755 citee title:using association rules for product assortment decisions: a case study citee abstract:it has been claimed that the discovery of association rules iswell-suited for applications of market basket analysis to revealregularities in the purchase behaviour of customersfi moreover,recent work indicates that the discovery of interesting rules canin fact only be addressed within a microeconomic frameworkfithis study integrates the discovery of frequent itemsets with a(microeconomic) model for product selection (profset)fi themodel enables the integration of both quantitative and surrounding text:1 item selection relatedwork there are some recent works on the maximal-profit item selection problem. profset [***, 7]<2> models the cross-selling effects by frequent itemsets, which are sets of items cooccurring frequently. a maximal frequent itemset is a frequent itemset which does not have a frequent item superset influence:1 type:2 pair index:182 citer id:135 citer title:mpis: maximal-profit item selection with cross-selling considerations citer abstract:in the literature of data mining, many different algorithms for association rule mining have been proposed. however, there is relatively little study on how association rules can aid in more specific targets. in this paper, one of the applications for association rules - maximal-profit item selection with cross -selling effect (mpis) problem - is investigated. the problem is about selecting a subset of items which can give the maximal profit with the consideration of cross-selling. we prove that a simple version of this problem is np-hard. we propose a new approach to the problem with the consideration of the loss rule - a kind of association rule to model the cross-selling effect. we show that the problem can be transformed to a quadratic programming problem. in case quadratic programming is not applicable, we also propose a heuristic approach. experiments are conducted to show that both of the proposed methods are highly effective and efficient citee id:317 citee title:computers and intractability: a guide to the theory of np-completeness citee abstract:it was the very first book on the theory of np-completeness and computational intractability. the book features an appendix providing a thorough compendium of np-complete problems (which was updated in later editions of the book). the book is now outdated in some respects as it does not cover more recent development such as the pcp theorem. it is nevertheless still in print and is regarded as a classic: in a 2006 study, the citeseer search engine listed the book as the most cited reference in computer science literature. surrounding text:proof sketch: we shall transform the problem of clique to the mpis problem. clique [***]<3> is an np-complete problem defined as followd: clique: given a graph {}". ~  influence:3 type:3 pair index:183 citer id:135 citer title:mpis: maximal-profit item selection with cross-selling considerations citer abstract:in the literature of data mining, many different algorithms for association rule mining have been proposed. however, there is relatively little study on how association rules can aid in more specific targets. in this paper, one of the applications for association rules - maximal-profit item selection with cross -selling effect (mpis) problem - is investigated. the problem is about selecting a subset of items which can give the maximal profit with the consideration of cross-selling. we prove that a simple version of this problem is np-hard. we propose a new approach to the problem with the consideration of the loss rule - a kind of association rule to model the cross-selling effect. we show that the problem can be transformed to a quadratic programming problem. in case quadratic programming is not applicable, we also propose a heuristic approach. experiments are conducted to show that both of the proposed methods are highly effective and efficient citee id:735 citee title:mining frequent patterns without candidate generation citee abstract:mining frequent patterns in transaction databases, time-series databases, and many other kinds ofdatabases has been studied popularly in data mining research. most of the previous studies adopt an apriori-likecandidate set generation-and-test approach. however, candidate set generation is still costly, especially when thereexist a large number of patterns and/or long patterns.in this study, we propose a novelfrequent-pattern tree(fp-tree) structure, which is an extended prefix-treestructure for storing compressed, crucial information about frequent patterns, and develop an efficient fp-tree-based mining method, fp-growth, for mining the complete set of frequent patterns by pattern fragment growth.efficiency of mining is achieved with three techniques: (1) a large database is compressed into a condensed,smaller data structure, fp-tree which avoids costly, repeated database scans, (2) our fp-tree-based mining adoptsa pattern-fragment growth method to avoid the costly generation of a large number of candidate sets, and (3) apartitioning-based, divide-and-conquer method is used to decompose the mining task into a set of smaller tasks formining confined patterns in conditional databases, which dramatically reduces the search space. our performancestudy shows that the fp-growth method is efficient and scalable for mining both long and short frequent patterns,and is about an order of magnitude faster than the apriori algorithm and also faster than some recently reportednew frequent-pattern mining methods. surrounding text:if we actually scan the given database, which typically contains one record for each transaction, the computation will be very costly. here we make use of the fp-tree structure [***]<1>. we construct an fp-tree  once for all transactions, setting the support threshold to zero, and recording the occurrence count of itemsets at each tree node influence:3 type:3 pair index:184 citer id:135 citer title:mpis: maximal-profit item selection with cross-selling considerations citer abstract:in the literature of data mining, many different algorithms for association rule mining have been proposed. however, there is relatively little study on how association rules can aid in more specific targets. in this paper, one of the applications for association rules - maximal-profit item selection with cross -selling effect (mpis) problem - is investigated. the problem is about selecting a subset of items which can give the maximal profit with the consideration of cross-selling. we prove that a simple version of this problem is np-hard. we propose a new approach to the problem with the consideration of the loss rule - a kind of association rule to model the cross-selling effect. we show that the problem can be transformed to a quadratic programming problem. in case quadratic programming is not applicable, we also propose a heuristic approach. experiments are conducted to show that both of the proposed methods are highly effective and efficient citee id:85 citee title:a finite algorithm to maximize certain pseudoconcave functions on polytopes citee abstract:this paper develops and proves an algorithm that finds the exact maximum of certain nonlinear functions on polytopes by performing a finite number of logical and arithmetic operations. permissible objective functions need to be pseudoconcave and allow the closed-form solution of sets of equations , which are first order conditions associated with the unconstrained, but affinely transformed objective function. examples are pseudoconcave quadratics and especially the homogeneous functioncx +m(xvx)1/2,m < 0, v positive definite, for which sofar no finite algorithm existed. in distinction to most available methods, this algorithm uses the internal representation of the feasible set to selectively decompose it into simplices of varying dimensions; linear programming and a gradient criterion are used to select a sequence of these simplices, which contain a corresponding sequence of strictly increasing, relative and relatively interior maxima, the greatest of which is shown to be the global maximum on the feasible set. to find the interior maxima on these simplices in a finite way, calculus maximizations on the affine hulls of subsets of their vertices are necessary; thus the above requirement that be explicitly solvable. the paper presents a flow structure of the algorithm, its supporting theory, its decision-theoretic use, and an example, computed by an apl-version of the method. surrounding text:the first factor is dominant when the selection is below 50% but the second factor becomes dominant when the selection is larger than 50%. the quadratic programming approach (qp) used in the chosen solver uses a variant of the simplex method to determine a feasible region and then uses the methods described in [***]<2> to find the solution. as the approach uses an iterative step based on the current state to determine the next step, the execution time is quite fluctuating as the execution time is mainly dependent on the problem (or which state the algorithm is in) influence:3 type:3 pair index:185 citer id:135 citer title:mpis: maximal-profit item selection with cross-selling considerations citer abstract:in the literature of data mining, many different algorithms for association rule mining have been proposed. however, there is relatively little study on how association rules can aid in more specific targets. in this paper, one of the applications for association rules - maximal-profit item selection with cross -selling effect (mpis) problem - is investigated. the problem is about selecting a subset of items which can give the maximal profit with the consideration of cross-selling. we prove that a simple version of this problem is np-hard. we propose a new approach to the problem with the consideration of the loss rule - a kind of association rule to model the cross-selling effect. we show that the problem can be transformed to a quadratic programming problem. in case quadratic programming is not applicable, we also propose a heuristic approach. experiments are conducted to show that both of the proposed methods are highly effective and efficient citee id:661 citee title:introduction to global optimization citee abstract:accurate modelling of real-world problems often requires nonconvex terms to be introduced in themodel, either in the objective function or in the constraints. nonconvex programming is one of thehardest fields of optimization, presenting many challenges in both practical and theoretical aspects.the presence of multiple local minima calls for the application of global optimization techniques.this paper is a mini-course about global optimization techniques in nonconvex programming; it dealswith some theoretical aspects of nonlinear programming as well as with some of the current state-of-the-art algorithms in global optimization. the syllabus is as follows. some examples of nonlinearprogramming problems (nlps). general description of two-phase algorithms. local optimizationof nlps: derivation of kkt conditions. short notes about stochastic global multistart algorithmswith a concrete example (sobolopt). in-depth study of a deterministic spatial branch-and-boundalgorithm, and convex relaxation of an nlp. latest advances in bilinear programming: the theory ofreduction constraints. surrounding text:an unconstrained binary quadratic programming problem can be transformed to a binary linear programming problem (zero-one linear programming) [5]<3>. more related properties can be found in [20]<3> and [***]<3>. zero-one linear programming and quadratic programming are known to be np-complete [24]<3> influence:3 type:3 pair index:186 citer id:135 citer title:mpis: maximal-profit item selection with cross-selling considerations citer abstract:in the literature of data mining, many different algorithms for association rule mining have been proposed. however, there is relatively little study on how association rules can aid in more specific targets. in this paper, one of the applications for association rules - maximal-profit item selection with cross -selling effect (mpis) problem - is investigated. the problem is about selecting a subset of items which can give the maximal profit with the consideration of cross-selling. we prove that a simple version of this problem is np-hard. we propose a new approach to the problem with the consideration of the loss rule - a kind of association rule to model the cross-selling effect. we show that the problem can be transformed to a quadratic programming problem. in case quadratic programming is not applicable, we also propose a heuristic approach. experiments are conducted to show that both of the proposed methods are highly effective and efficient citee id:756 citee title:quadratic binary programming and dynamical system approach to determine the predictability of epileptic seizures citee abstract:epilepsy is one of the most common disorders of the nervous system. the progressive entrainment between an epileptogenic focus and normal brain areas results to transitions of the brain from chaotic to less chaotic spatiotemporal states, the epileptic seizures. the entrainment between two brain sites can be quantified by the t-index from the measures of chaos (e.g., lyapunov exponents) of the electrical activity (eeg) of the brain. by applying the optimization theory, in particular quadratic zero-one programming, we were able to select the most entrained brain sites 10 minutes before seizures and subsequently follow their entrainment over 2 hours before seizures. in five patients with 3�c24 seizures, we found that over 90% of the seizures are predictable by the optimal selection of electrode sites. this procedure, which is applied to epilepsy research for the first time, shows the possibility of prediction of epileptic seizures well in advance (19.8 to 42.9 minutes) of their occurrence. surrounding text:: maximize g +=-iz k d  m & d kvu d such that h ", lnm d , -zy , and d , -\[ or d , - a for x -va 4. p4fi4 fi any 0-1 quadratic programming problem is polynomially reducible to an unconstrained binary quadratic programming problem [***]<3>. an unconstrained binary quadratic programming problem can be transformed to a binary linear programming problem (zero-one linear programming) [5]<3> influence:3 type:3 pair index:187 citer id:135 citer title:mpis: maximal-profit item selection with cross-selling considerations citer abstract:in the literature of data mining, many different algorithms for association rule mining have been proposed. however, there is relatively little study on how association rules can aid in more specific targets. in this paper, one of the applications for association rules - maximal-profit item selection with cross -selling effect (mpis) problem - is investigated. the problem is about selecting a subset of items which can give the maximal profit with the consideration of cross-selling. we prove that a simple version of this problem is np-hard. we propose a new approach to the problem with the consideration of the loss rule - a kind of association rule to model the cross-selling effect. we show that the problem can be transformed to a quadratic programming problem. in case quadratic programming is not applicable, we also propose a heuristic approach. experiments are conducted to show that both of the proposed methods are highly effective and efficient citee id:131 citee title:a microeconomic view of data mining citee abstract:we present a rigorous framework, based on optimization, for evaluating data miningoperations such as associations and clustering, in terms of their utility in decisionmakingfithis framework leads quickly to some interesting computational problemsrelated to sensitivity analysis, segmentation and the theory of gamesfidepartment of computer science, cornell university, ithaca ny 14853fi email: kleinber@csficornellfiedufisupported in part by an alfred pfi sloan research fellowship and by nsf surrounding text:the cross-selling effect arises because there can be items that do not generate much profit by themselves but they are the catalysts for the sales of other profitable items. recently, some researchers [***]<1> suggest that association rules can be used in the item selection problem with the consideration of relationships among items. here we follow this line of work in what we consider as investigations of the application of data mining in the decision-making process of an enterprise influence:2 type:2 pair index:188 citer id:135 citer title:mpis: maximal-profit item selection with cross-selling considerations citer abstract:in the literature of data mining, many different algorithms for association rule mining have been proposed. however, there is relatively little study on how association rules can aid in more specific targets. in this paper, one of the applications for association rules - maximal-profit item selection with cross -selling effect (mpis) problem - is investigated. the problem is about selecting a subset of items which can give the maximal profit with the consideration of cross-selling. we prove that a simple version of this problem is np-hard. we propose a new approach to the problem with the consideration of the loss rule - a kind of association rule to model the cross-selling effect. we show that the problem can be transformed to a quadratic programming problem. in case quadratic programming is not applicable, we also propose a heuristic approach. experiments are conducted to show that both of the proposed methods are highly effective and efficient citee id:241 citee title:authoritative sources in a hyperlinked environment citee abstract:the network structure of a hyperlinked environment can be a rich source of in-formation about the content of the environment, provided we have effective means forunderstanding it. we develop a set of algorithmic tools for extracting information fromthe link structures of such environments, and report on experiments that demonstratetheir effectiveness in a variety of contexts on the world wide web. the central issuewe address within our framework is the distillation of broad search topics, throughthe discovery of ��authoritative�� information sources on such topics. we propose andtest an algorithmic formulation of the notion of authority, based on the relationshipbetween a set of relevant authoritative pages and the set of ��hub pages�� that join themtogether in the link structure. our formulation has connections to the eigenvectorsof certain matrices associated with the link graph; these connections in turn motivateadditional heuristics for link-based analysis. surrounding text:with a strong strength of the link. the hits algorithm [***]<3> is applied and the items with the highest resulting authorities will be the chosen items. it is shown that the result converges to the principal eigenvectors of a matrix defined in terms of the links, confidence values, and profit values influence:3 type:3 pair index:189 citer id:135 citer title:mpis: maximal-profit item selection with cross-selling considerations citer abstract:in the literature of data mining, many different algorithms for association rule mining have been proposed. however, there is relatively little study on how association rules can aid in more specific targets. in this paper, one of the applications for association rules - maximal-profit item selection with cross -selling effect (mpis) problem - is investigated. the problem is about selecting a subset of items which can give the maximal profit with the consideration of cross-selling. we prove that a simple version of this problem is np-hard. we propose a new approach to the problem with the consideration of the loss rule - a kind of association rule to model the cross-selling effect. we show that the problem can be transformed to a quadratic programming problem. in case quadratic programming is not applicable, we also propose a heuristic approach. experiments are conducted to show that both of the proposed methods are highly effective and efficient citee id:722 citee title:methods and problems in data mining citee abstract:knowledge discovery in databases and data mining aim at semiautomatic tools for the analysis of large data setsfi we consider some methods used in data mining, concentrating on levelwise search for all frequently occurring patternsfi we show how this technique can be used in various applicationsfi we also discuss possibilities for compiling data mining queries into algorithms, and look at the use of sampling in data miningfi we conclude by listing several open research problems in data mining and surrounding text:the problem is to find all rules with sufficient support and confidence. some of the earlier work include [22, 4, ***]<2>. 3 influence:2 type:2 pair index:190 citer id:135 citer title:mpis: maximal-profit item selection with cross-selling considerations citer abstract:in the literature of data mining, many different algorithms for association rule mining have been proposed. however, there is relatively little study on how association rules can aid in more specific targets. in this paper, one of the applications for association rules - maximal-profit item selection with cross -selling effect (mpis) problem - is investigated. the problem is about selecting a subset of items which can give the maximal profit with the consideration of cross-selling. we prove that a simple version of this problem is np-hard. we propose a new approach to the problem with the consideration of the loss rule - a kind of association rule to model the cross-selling effect. we show that the problem can be transformed to a quadratic programming problem. in case quadratic programming is not applicable, we also propose a heuristic approach. experiments are conducted to show that both of the proposed methods are highly effective and efficient citee id:460 citee title:efficient algorithms for discovering association rules citee abstract:association rules are statements of the form "for 90 % of the rows of the relation, if the row has value 1 in the columns in set w , then it has 1 also in column b"fi agrawal, imielinski, and swami introduced the problem of mining association rules from large collections of data, and gave a method based on successive passes over the databasefi we give an improved algorithm for the problemfi the method is based on careful combinatorial analysis of the information obtained in previous passes; this surrounding text:the problem is to find all rules with sufficient support and confidence. some of the earlier work include [***, 4, 21]<2>. 3 influence:2 type:2 pair index:191 citer id:135 citer title:mpis: maximal-profit item selection with cross-selling considerations citer abstract:in the literature of data mining, many different algorithms for association rule mining have been proposed. however, there is relatively little study on how association rules can aid in more specific targets. in this paper, one of the applications for association rules - maximal-profit item selection with cross -selling effect (mpis) problem - is investigated. the problem is about selecting a subset of items which can give the maximal profit with the consideration of cross-selling. we prove that a simple version of this problem is np-hard. we propose a new approach to the problem with the consideration of the loss rule - a kind of association rule to model the cross-selling effect. we show that the problem can be transformed to a quadratic programming problem. in case quadratic programming is not applicable, we also propose a heuristic approach. experiments are conducted to show that both of the proposed methods are highly effective and efficient citee id:757 citee title:optimizing web servers using page rank prefetching for clustered accesses citee abstract:this paper presents a page rank-based prefetching technique for accesses to web page clusters. the approach uses the link structure of a requested page to determine the "most important" linked pages and to identify the page(s) to be prefetched. the underlying premise of our approach is that in the case of cluster accesses, the next pages requested by users of the web server are typically based on the current and previous pages requested. furthermore, if the requested pages have a lot of links to some "important" page, that page has a higher probability of being the next one requested. an experimental evaluation of the prefetching mechanism is presented using real server logs. the results show that the page rank-based scheme does better than random prefetching for clustered accesses, with hit rates of 90% in some cases. surrounding text:hap [26]<2> is a solution of a similar problem. it applies the ��hub-authority�� profit ranking approach [***]<3> to solve the maximal profit item-selection problem. items are considered as vertices in a graph influence:3 type:3 pair index:192 citer id:135 citer title:mpis: maximal-profit item selection with cross-selling considerations citer abstract:in the literature of data mining, many different algorithms for association rule mining have been proposed. however, there is relatively little study on how association rules can aid in more specific targets. in this paper, one of the applications for association rules - maximal-profit item selection with cross -selling effect (mpis) problem - is investigated. the problem is about selecting a subset of items which can give the maximal profit with the consideration of cross-selling. we prove that a simple version of this problem is np-hard. we propose a new approach to the problem with the consideration of the loss rule - a kind of association rule to model the cross-selling effect. we show that the problem can be transformed to a quadratic programming problem. in case quadratic programming is not applicable, we also propose a heuristic approach. experiments are conducted to show that both of the proposed methods are highly effective and efficient citee id:344 citee title:computationally related problems citee abstract:we look at several problems from areas such as network flows, game theory, artificial intelligence, graph theory, integer programming and nonlinear programming and show that they are related in that any one of these problems is solvable in polynomial time if all the others are, too. at present, no polynomial time algorithm for these problems is known. these problems extend the equivalence class of problems known as p-complete. the problem of deciding whether the class of languages accepted by polynomial time nondeterministic turing machines is the same as that accepted by polynomial time deterministic turing machines is related to p-complete problems in that these two classes of languages are the same if each p-complete problem has a polynomial deterministic solution. in view of this, it appears very likely that this equivalence class defines a class of problems that cannot be solved in deterministic polynomial time. surrounding text:more related properties can be found in [20]<3> and [14]<3>. zero-one linear programming and quadratic programming are known to be np-complete [***]<3>. however, there exist programming tools which can typically return good results within a reasonable time for moderate problem sizes influence:3 type:3 pair index:193 citer id:135 citer title:mpis: maximal-profit item selection with cross-selling considerations citer abstract:in the literature of data mining, many different algorithms for association rule mining have been proposed. however, there is relatively little study on how association rules can aid in more specific targets. in this paper, one of the applications for association rules - maximal-profit item selection with cross -selling effect (mpis) problem - is investigated. the problem is about selecting a subset of items which can give the maximal profit with the consideration of cross-selling. we prove that a simple version of this problem is np-hard. we propose a new approach to the problem with the consideration of the loss rule - a kind of association rule to model the cross-selling effect. we show that the problem can be transformed to a quadratic programming problem. in case quadratic programming is not applicable, we also propose a heuristic approach. experiments are conducted to show that both of the proposed methods are highly effective and efficient citee id:136 citee title:item selection by ��hub-authority�� profit ranking citee abstract:a fundamental problem in business and other applications is ranking items with respect to some notion of profit based on historical transactions. the difficulty is that the profit of one item not only comes from its own sales, but also from its influence on the sales of other items, i.e., the "cross-selling effect". in this paper, we draw an analogy between this influence and the mutual reinforcement of hub/authority web pages. based on this analogy, we present a novel approach to the item surrounding text:33 times higher than the naive approach for the synthetic data set. in a real drugstore data set, the best previous method hap [***]<2> gives a profitability that is about 2. 9 times smaller than mpis alg. 2 problem definition maximal-profit item selection (mpis) is a problem of selecting a subset from a given set of items so that the estimated profit of the resulting selection is maximal among all choices. our definition of the problem is close to [***]<2>. given a data set with. the problem is formulated as 0-1 linear programming that aims to maximize the total profit. however, profset has several drawbacks as pointed out in [***]<2>. more details can be found in [***]<2>. however, profset has several drawbacks as pointed out in [***]<2>. more details can be found in [***]<2>. hap [***]<2> is a solution of a similar problem. more details can be found in [***]<2>. hap [***]<2> is a solution of a similar problem. it applies the ��hub-authority�� profit ranking approach [23]<3> to solve the maximal profit item-selection problem. <1:   of some items for other items. in previous work [***]<2>, the concept of association rules is applied to this task. here we also apply the ideas of association rules for the determination of :. there are some factors that make this algorithm desirable: (1) we utilize the exact formula of the profitability in the iterations. this will steer the result better toward the goal of maximal profits compared to other approaches [***]<2> that do not directly use the formula. (2)with the ��neighborhood�� consideration, the item pruning at each iteration usually affect only a minor portion of the set of items and hence introduce only a small amount of computation for an iteration. 7. 1 synthetic data set in our experiment, we use the ibm synthetic data generator in [2]<3> to generate the data set with the following parameters (same as the parameters of [***]<2>): 1,000 items, 10,000 transactions, 10 items per transaction on average, and 4 items per frequent itemset on average. the price distribution can be approximated by a lognormal distribution, as pointed out in [15]<3>. the price distribution can be approximated by a lognormal distribution, as pointed out in [15]<3>. we use the same settings as [***]<2>. that is, 10% of items have the low profit range between $0. 15% 7. 3 results for synthetic data in the first experiment, we have the same setup as in [***]<2> but the profit follows lognormal distribution. the result is shown in figure 2 influence:1 type:2 pair index:194 citer id:163 citer title:a sentiment-aware model for predicting sales citer abstract:due to its high popularity,weblogs (or blogs in short) present a wealth of information that can be very helpful in assessing the general public��s sentiments and opinions. in this paper, we study the problem of mining sentiment information from blogs and investigate ways to use such information for predicting product sales performance. based on an analysis of the complex nature of sentiments, we propose sentiment plsa (s-plsa), in which a blog entry is viewed as a document generated by a number of hidden sentiment factors. training an s-plsa model on the blog data enables us to obtain a succinct summary of the sentiment information embedded in the blogs. we then present arsa, an autoregressive sentiment-aware model, to utilize the sentiment information captured by s-plsa for predicting product sales performance. extensive experiments were conducted on a movie data set. we compare arsa with alternative models that do not take into account the sentiment information, as well as a model with a different feature selection method. experiments confirm the effectiveness and superiority of the proposed approach citee id:164 citee title:latent dirichlet allocation citee abstract:we describe latent dirichlet allocation (lda), a generative probabilistic model for collections of discrete data such as text corpora. lda is a three-level hierarchical bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. in the context of text modeling, the topic probabilities provide an explicit representation of a document. we present efficient approximate inference techniques based on variational methods and an em algorithm for empirical bayes parameter estimation. we report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic lsi model surrounding text:on the other hand, as shown in the graph, the prediction accuracy deteriorates once k gets past 4. the explanation here is that a large k may cause the problem of overfitting [***]<3>, i. e influence:3 type:1 pair index:195 citer id:163 citer title:a sentiment-aware model for predicting sales citer abstract:due to its high popularity,weblogs (or blogs in short) present a wealth of information that can be very helpful in assessing the general public��s sentiments and opinions. in this paper, we study the problem of mining sentiment information from blogs and investigate ways to use such information for predicting product sales performance. based on an analysis of the complex nature of sentiments, we propose sentiment plsa (s-plsa), in which a blog entry is viewed as a document generated by a number of hidden sentiment factors. training an s-plsa model on the blog data enables us to obtain a succinct summary of the sentiment information embedded in the blogs. we then present arsa, an autoregressive sentiment-aware model, to utilize the sentiment information captured by s-plsa for predicting product sales performance. extensive experiments were conducted on a movie data set. we compare arsa with alternative models that do not take into account the sentiment information, as well as a model with a different feature selection method. experiments confirm the effectiveness and superiority of the proposed approach citee id:165 citee title:system for spatio-temporal analysis of online news and blogs citee abstract:previous work on spatio-temporal analysis of news items and other documents has largely focused on broad categorization of small text collections by region or country. a system for large-scale spatio-temporal analysis of online news media and blogs is presented, together with an analysis of global news media coverage over a nine year period. we demonstrate the benefits of using a hierarchical geospatial database to disambiguate between geographical named entities, and provide results for an extremely fine-grained analysis of news items. aggregate maps of media attention for particular places around the world are compared with geographical and socio-economic data. our analysis suggests that gdp per capita is the best indicator for media attention surrounding text:they therefore believe that the temporal information of topics may help to forecast spike patterns in sales rank. the other direction focuses on analyzing the contents of blogs [14, 17, 16, ***]<2>. mei et al. to analyze spatiotemporal theme patterns from blogs, location variable and theme snapshot for each given time period are integrated in the model. dalli [***]<2> builds a system for large-scale spatiotemporal analysis of online news and blogs. he utilizes an embedded hierarchical geospatial database to distinguish geographical named entities, and provides results for an extremely fine-grained analysis of items in news contents influence:3 type:2 pair index:196 citer id:163 citer title:a sentiment-aware model for predicting sales citer abstract:due to its high popularity,weblogs (or blogs in short) present a wealth of information that can be very helpful in assessing the general public��s sentiments and opinions. in this paper, we study the problem of mining sentiment information from blogs and investigate ways to use such information for predicting product sales performance. based on an analysis of the complex nature of sentiments, we propose sentiment plsa (s-plsa), in which a blog entry is viewed as a document generated by a number of hidden sentiment factors. training an s-plsa model on the blog data enables us to obtain a succinct summary of the sentiment information embedded in the blogs. we then present arsa, an autoregressive sentiment-aware model, to utilize the sentiment information captured by s-plsa for predicting product sales performance. extensive experiments were conducted on a movie data set. we compare arsa with alternative models that do not take into account the sentiment information, as well as a model with a different feature selection method. experiments confirm the effectiveness and superiority of the proposed approach citee id:166 citee title:the liberal media and right-wing conspiracies: using cocitation information to estimate political orientation in web documents citee abstract:this paper introduces a simple method for estimating cultural orientation, the affiliation of online entities in a polarized field of discourse. in particular, cocitation information is used to estimate the political orientation of hypertext documents. a type of cultural orientation, the political orientation of a document is the degree to which it participates in traditionally left- or right-wing beliefs. estimating documents' political orientation is of interest for personalized information retrieval and recommender systems. in its application to politics, the method uses a simple probabilistic model to estimate the strength of association between a document and left- and right-wing communities. the model estimates the likelihood of cocitation between a document of interest and a small number of documents of known orientation. the model is tested on three sets of data, 695 partisan web documents, 162 political weblogs, and 72 non-partisan documents. accuracy above 90% is obtained from the cocitation model, outperforming lexically based classifiers at statistically significant levels surrounding text:there are currently two major research directions on blog analysis. one direction is to make use of links or urls in blogspace [22, 1, 12, ***, 7]<2>. kumar et al. by observing the evolving behavior of the time graph, burst defined as a large sequence of temporally focused documents with plenty of links between them can be traced. efron [***]<2> describes a hyperlink-based method to estimate the political orientation of web documents. by estimating the likelihood of cocitation between a document of interest and documents with known orientations, the unknown document is classified to either left- or right-wing community influence:3 type:2 pair index:197 citer id:163 citer title:a sentiment-aware model for predicting sales citer abstract:due to its high popularity,weblogs (or blogs in short) present a wealth of information that can be very helpful in assessing the general public��s sentiments and opinions. in this paper, we study the problem of mining sentiment information from blogs and investigate ways to use such information for predicting product sales performance. based on an analysis of the complex nature of sentiments, we propose sentiment plsa (s-plsa), in which a blog entry is viewed as a document generated by a number of hidden sentiment factors. training an s-plsa model on the blog data enables us to obtain a succinct summary of the sentiment information embedded in the blogs. we then present arsa, an autoregressive sentiment-aware model, to utilize the sentiment information captured by s-plsa for predicting product sales performance. extensive experiments were conducted on a movie data set. we compare arsa with alternative models that do not take into account the sentiment information, as well as a model with a different feature selection method. experiments confirm the effectiveness and superiority of the proposed approach citee id:167 citee title:the predictive power of online chatter citee abstract:an increasing fraction of the global discourse is migrating online in the form of blogs, bulletin boards, web pages, wikis, editorials, and a dizzying array of new collaborative technologies. the migration has now proceeded to the point that topics reflecting certain individual products are sufficiently popular to allow targeted online tracking of the ebb and flow of chatter around these topics. based on an analysis of around half a million sales rank values for 2,340 books over a period of four months, and correlating postings in blogs, media, and web pages, we are able to draw several interesting conclusions.first, carefully hand-crafted queries produce matching postings whose volume predicts sales ranks. second, these queries can be automatically generated in many cases. and third, even though sales rank motion might be difficult to predict in general, algorithmic predictors can use online postings to successfully predict spikes in sales rank. surrounding text:we expect the models and algorithms developed for box office prediction to be easily adapted to handle other types of products that are subject to online discussions, such as books, music cds and electronics. prior studies on the predictive power of blogs have used the volume of blogs or link structures to predict the trend of product sales [***, 8]<2>, failing to consider the effect of the sentiments present in the blogs. it has been reported [***, 8]<2> that although there seems to exist strong correlation between the blog mentions and sales spikes, using the volume or the link structures alone do not provide satisfactory prediction performance. prior studies on the predictive power of blogs have used the volume of blogs or link structures to predict the trend of product sales [***, 8]<2>, failing to consider the effect of the sentiments present in the blogs. it has been reported [***, 8]<2> that although there seems to exist strong correlation between the blog mentions and sales spikes, using the volume or the link structures alone do not provide satisfactory prediction performance. indeed, as we will illustrate with an example, the sentiments expressed in the blogs are more predictive than volumes. there are currently two major research directions on blog analysis. one direction is to make use of links or urls in blogspace [22, 1, 12, 5, ***]<2>. kumar et al. gruhl et al. [***, 8]<2> prove that there is a strong correlation between blog mentions and sales rank. they therefore believe that the temporal information of topics may help to forecast spike patterns in sales rank. characteristics of online discussions intuitively, a newly released product that evokes a lot of online discussions is likely to have an outstanding sales performance. however, evidences show that even if there exists a strong correlation between the number of blog mentions of a new product and the sales rank of the product, it could still be very difficult to make a successful prediction of sales ranks based on the number of blog mentions [***]<2>. to make a better understanding of the characteristics of online discussions and their predictive power, we investigate the pattern of blog mentions and its relationship to sales data by examining a real example from the movie sector influence:1 type:2 pair index:198 citer id:163 citer title:a sentiment-aware model for predicting sales citer abstract:due to its high popularity,weblogs (or blogs in short) present a wealth of information that can be very helpful in assessing the general public��s sentiments and opinions. in this paper, we study the problem of mining sentiment information from blogs and investigate ways to use such information for predicting product sales performance. based on an analysis of the complex nature of sentiments, we propose sentiment plsa (s-plsa), in which a blog entry is viewed as a document generated by a number of hidden sentiment factors. training an s-plsa model on the blog data enables us to obtain a succinct summary of the sentiment information embedded in the blogs. we then present arsa, an autoregressive sentiment-aware model, to utilize the sentiment information captured by s-plsa for predicting product sales performance. extensive experiments were conducted on a movie data set. we compare arsa with alternative models that do not take into account the sentiment information, as well as a model with a different feature selection method. experiments confirm the effectiveness and superiority of the proposed approach citee id:168 citee title:information diffusion through blogspace citee abstract:we study the dynamics of information propagation in environments of low-overhead personal publishing, using a large collection of weblogs over time as our example domain. we characterize and model this collection at two levels. first, we present a macroscopic characterization of topic propagation through our corpus, formalizing the notion of long-running "chatter" topics consisting recursively of "spike" topics generated by outside world events, or more rarely, by resonances within the community. second, we present a microscopic characterization of propagation from individual to individual, drawing on the theory of infectious diseases to model the flow. we propose, validate, and employ an algorithm to induce the underlying propagation network from a sequence of posts, and report on the results. surrounding text:we expect the models and algorithms developed for box office prediction to be easily adapted to handle other types of products that are subject to online discussions, such as books, music cds and electronics. prior studies on the predictive power of blogs have used the volume of blogs or link structures to predict the trend of product sales [7, ***]<2>, failing to consider the effect of the sentiments present in the blogs. it has been reported [7, ***]<2> that although there seems to exist strong correlation between the blog mentions and sales spikes, using the volume or the link structures alone do not provide satisfactory prediction performance. prior studies on the predictive power of blogs have used the volume of blogs or link structures to predict the trend of product sales [7, ***]<2>, failing to consider the effect of the sentiments present in the blogs. it has been reported [7, ***]<2> that although there seems to exist strong correlation between the blog mentions and sales spikes, using the volume or the link structures alone do not provide satisfactory prediction performance. indeed, as we will illustrate with an example, the sentiments expressed in the blogs are more predictive than volumes. gruhl et al. [7, ***]<2> prove that there is a strong correlation between blog mentions and sales rank. they therefore believe that the temporal information of topics may help to forecast spike patterns in sales rank influence:1 type:2 pair index:199 citer id:163 citer title:a sentiment-aware model for predicting sales citer abstract:due to its high popularity,weblogs (or blogs in short) present a wealth of information that can be very helpful in assessing the general public��s sentiments and opinions. in this paper, we study the problem of mining sentiment information from blogs and investigate ways to use such information for predicting product sales performance. based on an analysis of the complex nature of sentiments, we propose sentiment plsa (s-plsa), in which a blog entry is viewed as a document generated by a number of hidden sentiment factors. training an s-plsa model on the blog data enables us to obtain a succinct summary of the sentiment information embedded in the blogs. we then present arsa, an autoregressive sentiment-aware model, to utilize the sentiment information captured by s-plsa for predicting product sales performance. extensive experiments were conducted on a movie data set. we compare arsa with alternative models that do not take into account the sentiment information, as well as a model with a different feature selection method. experiments confirm the effectiveness and superiority of the proposed approach citee id:169 citee title:probabilistic latent semantic analysis citee abstract:probabilistic latent semantic analysis is a novel statistical technique for the analysis of two{mode and co-occurrence data, which has applications in information retrieval and ltering, natural language processing, ma-chine learning from text, and in related ar-easfi compared to standard latent semantic analysis which stems from linear algebra and performs a singular value decomposition of co-occurrence tables, the proposed method is based on a mixture decomposition derived from a latent class modelfi this results in a more principled approach which has a solid foundation in statisticsfi in order to avoid over tting, we propose a widely applicable generalization of maximum likelihood model tting by tempered emfi our approach yields substantial and consistent improvements over latent semantic analysis in a number of ex-perimentsfi surrounding text:in order to model the multifaceted nature of sentiments, we view the sentiments embedded in blogs as an outcome of the joint contribution of a number of hidden factors, and propose a novel approach to sentiment mining based on probabilistic latent semantic analysis (plsa), which we call sentiment plsa (s-plsa). different from the traditional plsa [***]<2>, s-plsa focuses on sentiments rather than topics. therefore, instead of taking a vanilla ��bag of words�� approach and considering all the words (modulo stop words) present in the blogs, we focus primarily on the words that are sentiment-related. the use of a probabilistic generative model, on the other hand, enables us to deal with sentiment analysis in a principled way. in its traditional form, plsa [***]<2> assumes that there are a set of hidden semantic factors or aspects in the documents, and models the relationship among these factors, documents, and words under a probabilistic framework. with its high flexibility and solid statistical foundations, plsa has been widely used in many areas, including information retrieval, web usage mining, and collaborative filtering influence:1 type:1 pair index:200 citer id:163 citer title:a sentiment-aware model for predicting sales citer abstract:due to its high popularity,weblogs (or blogs in short) present a wealth of information that can be very helpful in assessing the general public��s sentiments and opinions. in this paper, we study the problem of mining sentiment information from blogs and investigate ways to use such information for predicting product sales performance. based on an analysis of the complex nature of sentiments, we propose sentiment plsa (s-plsa), in which a blog entry is viewed as a document generated by a number of hidden sentiment factors. training an s-plsa model on the blog data enables us to obtain a succinct summary of the sentiment information embedded in the blogs. we then present arsa, an autoregressive sentiment-aware model, to utilize the sentiment information captured by s-plsa for predicting product sales performance. extensive experiments were conducted on a movie data set. we compare arsa with alternative models that do not take into account the sentiment information, as well as a model with a different feature selection method. experiments confirm the effectiveness and superiority of the proposed approach citee id:170 citee title:dynamic, real-time forecasting of online auctions via functional models citee abstract:we propose a dynamic model for forecasting price in online auctions. one of the key features of our model is that it operates during the live-auction, which makes it different from previous approaches that only consider static models. our model is also different with respect to how information about price is incorporated. while one part of the model is based on the more traditional notion of an auction's price-level, another part incorporates its dynamics in the form of a price's velocity and acceleration. in that sense, it incorporates key features of a dynamic environment such as an online auction. the use of novel functional data methodology allows us to measure, and subsequently include, dynamic price characteristics. we illustrate our model on a diverse set of ebay auctions across many different book categories. we find significantly higher prediction accuracy compared to standard approaches. surrounding text:we evaluate the prediction performance of the arsa model by experimenting it with the testing data set. in this paper, we use the mean absolute percentage error (mape) [***]<3> to measure the prediction accuracy: mape = 1 n n . influence:3 type:3 pair index:201 citer id:163 citer title:a sentiment-aware model for predicting sales citer abstract:due to its high popularity,weblogs (or blogs in short) present a wealth of information that can be very helpful in assessing the general public��s sentiments and opinions. in this paper, we study the problem of mining sentiment information from blogs and investigate ways to use such information for predicting product sales performance. based on an analysis of the complex nature of sentiments, we propose sentiment plsa (s-plsa), in which a blog entry is viewed as a document generated by a number of hidden sentiment factors. training an s-plsa model on the blog data enables us to obtain a succinct summary of the sentiment information embedded in the blogs. we then present arsa, an autoregressive sentiment-aware model, to utilize the sentiment information captured by s-plsa for predicting product sales performance. extensive experiments were conducted on a movie data set. we compare arsa with alternative models that do not take into account the sentiment information, as well as a model with a different feature selection method. experiments confirm the effectiveness and superiority of the proposed approach citee id:171 citee title:words with attitude citee abstract:the traditional notion of word meaning used in natural language processing is literal or lexical meaning as used in dictionaries and lexicons. this relatively objective notion of lexical meaning is different from more subjective notions of emotive or affective meaning. our aim is to come to grips with subjective aspects of meaning expressed in written texts, such as the attitude or value expressed in them. this paper explores how the structure of the wordnet lexical database might be used to assess affective or emotive meaning. in particular, we construct measures based on osgood��s semantic differential technique. surrounding text:kamps et al. [***]<2> propose to evaluate the semantic distance from a word to good/bad with wordnet. turney [23]<2> measures the strength of sentiment by the difference of the mutual information (pmi) between the given phrase and ��excellent�� and the pmi between the given phrase and ��poor�� influence:3 type:2 pair index:202 citer id:163 citer title:a sentiment-aware model for predicting sales citer abstract:due to its high popularity,weblogs (or blogs in short) present a wealth of information that can be very helpful in assessing the general public��s sentiments and opinions. in this paper, we study the problem of mining sentiment information from blogs and investigate ways to use such information for predicting product sales performance. based on an analysis of the complex nature of sentiments, we propose sentiment plsa (s-plsa), in which a blog entry is viewed as a document generated by a number of hidden sentiment factors. training an s-plsa model on the blog data enables us to obtain a succinct summary of the sentiment information embedded in the blogs. we then present arsa, an autoregressive sentiment-aware model, to utilize the sentiment information captured by s-plsa for predicting product sales performance. extensive experiments were conducted on a movie data set. we compare arsa with alternative models that do not take into account the sentiment information, as well as a model with a different feature selection method. experiments confirm the effectiveness and superiority of the proposed approach citee id:172 citee title:on the bursty evolution of blogspace citee abstract:we propose two new tools to address the evolution of hyperlinked corpora. first, we define time graphs to extend the traditional notion of an evolving directed graph, capturing link creation as a point phenomenon in time. second, we develop definitions and algorithms for time-dense community tracking, to crystallize the notion of community evolution. we develop these tools in the context of blogspace , the space of weblogs (or blogs). our study involves approximately 750k links among 25k blogs. we create a time graph on these blogs by an automatic analysis of their internal time stamps. we then study the evolution of connected component structure and microscopic community structure in this time graph. we show that blogspace underwent a transition behavior around the end of 2001, and has been rapidly expanding over the past year, not just in metrics of scale, but also in metrics of community structure and connectedness. this expansion shows no sign of abating, although measures of connectedness must plateau within two years. by randomizing link destinations in blogspace, but retaining sources and timestamps, we introduce a concept of randomized blogspace . herein, we observe similar evolution of a giant component, but no corresponding increase in community structure. having demonstrated the formation of micro-communities over time, we then turn to the ongoing activity within active communities. we extend recent work of kleinberg to discover dense periods of "bursty" intra-community link creation. surrounding text:there are currently two major research directions on blog analysis. one direction is to make use of links or urls in blogspace [22, 1, ***, 5, 7]<2>. kumar et al. kumar et al. [***]<2> build a time graph for blogspace, and develop views of the graph as a function of time. by observing the evolving behavior of the time graph, burst defined as a large sequence of temporally focused documents with plenty of links between them can be traced influence:3 type:3 pair index:203 citer id:163 citer title:a sentiment-aware model for predicting sales citer abstract:due to its high popularity,weblogs (or blogs in short) present a wealth of information that can be very helpful in assessing the general public��s sentiments and opinions. in this paper, we study the problem of mining sentiment information from blogs and investigate ways to use such information for predicting product sales performance. based on an analysis of the complex nature of sentiments, we propose sentiment plsa (s-plsa), in which a blog entry is viewed as a document generated by a number of hidden sentiment factors. training an s-plsa model on the blog data enables us to obtain a succinct summary of the sentiment information embedded in the blogs. we then present arsa, an autoregressive sentiment-aware model, to utilize the sentiment information captured by s-plsa for predicting product sales performance. extensive experiments were conducted on a movie data set. we compare arsa with alternative models that do not take into account the sentiment information, as well as a model with a different feature selection method. experiments confirm the effectiveness and superiority of the proposed approach citee id:173 citee title:structure and evolution of blogspace citee abstract:a critical look at more than one million bloggers and the individual entries of some 25,000 blogs reveals blogger demographics, friendships, and activity patterns over time. surrounding text:, food, music, products, politics, etc. ), to highly personal interests [***]<3>. since many bloggers choose to express their opinions online, blogs serve as an excellent indicator of public sentiments and opinions influence:3 type:3 pair index:204 citer id:163 citer title:a sentiment-aware model for predicting sales citer abstract:due to its high popularity,weblogs (or blogs in short) present a wealth of information that can be very helpful in assessing the general public��s sentiments and opinions. in this paper, we study the problem of mining sentiment information from blogs and investigate ways to use such information for predicting product sales performance. based on an analysis of the complex nature of sentiments, we propose sentiment plsa (s-plsa), in which a blog entry is viewed as a document generated by a number of hidden sentiment factors. training an s-plsa model on the blog data enables us to obtain a succinct summary of the sentiment information embedded in the blogs. we then present arsa, an autoregressive sentiment-aware model, to utilize the sentiment information captured by s-plsa for predicting product sales performance. extensive experiments were conducted on a movie data set. we compare arsa with alternative models that do not take into account the sentiment information, as well as a model with a different feature selection method. experiments confirm the effectiveness and superiority of the proposed approach citee id:154 citee title:a probabilistic model for retrospective news event detection citee abstract:retrospective news event detection (red) is defined as the discovery of previously unidentified events in historical news corpus. although both the contents and time information of news articles are helpful to red, most researches focus on the utilization of the contents of news articles. few research works have been carried out on finding better usages of time information. in this paper, we do some explorations on both directions based on the following two characteristics of news articles. on the one hand, news articles are always aroused by events; on the other hand, similar articles reporting the same event often redundantly appear on many news sources. the former hints a generative model of news articles, and the latter provides data enriched environments to perform red. with consideration of these characteristics, we propose a probabilistic model to incorporate both content and time information in a unified framework. this model gives new representations of both news articles and news events. furthermore, based on this approach, we build an interactive red system, hiscovery, which provides additional functions to present events, photo story and chronicle. surrounding text:they therefore believe that the temporal information of topics may help to forecast spike patterns in sales rank. the other direction focuses on analyzing the contents of blogs [***, 17, 16, 3]<2>. mei et al influence:3 type:2 pair index:205 citer id:163 citer title:a sentiment-aware model for predicting sales citer abstract:due to its high popularity,weblogs (or blogs in short) present a wealth of information that can be very helpful in assessing the general public��s sentiments and opinions. in this paper, we study the problem of mining sentiment information from blogs and investigate ways to use such information for predicting product sales performance. based on an analysis of the complex nature of sentiments, we propose sentiment plsa (s-plsa), in which a blog entry is viewed as a document generated by a number of hidden sentiment factors. training an s-plsa model on the blog data enables us to obtain a succinct summary of the sentiment information embedded in the blogs. we then present arsa, an autoregressive sentiment-aware model, to utilize the sentiment information captured by s-plsa for predicting product sales performance. extensive experiments were conducted on a movie data set. we compare arsa with alternative models that do not take into account the sentiment information, as well as a model with a different feature selection method. experiments confirm the effectiveness and superiority of the proposed approach citee id:114 citee title:opinion observer: analyzing and comparing opinions on the web citee abstract:the web has become an excellent source for gathering consumer opinions. there are now numerous web sites containing such opinions, e.g., customer reviews of products, forums, discussion groups, and blogs. this paper focuses on online customer reviews of products. it makes two contributions. first, it proposes a novel framework for analyzing and comparing consumer opinions of competing products. a prototype system called opinion observer is also implemented. the system is such that with a single glance of its visualization, the user is able to clearly see the strengths and weaknesses of each product in the minds of consumers in terms of various product features. this comparison is useful to both potential customers and product manufacturers. for a potential customer, he/she can see a visual side-by-side and feature-by-feature comparison of consumer opinions on these products, which helps him/her to decide which product to buy. for a product manufacturer, the comparison enables it to easily gather marketing intelligence and product benchmarking information. second, a new technique based on language pattern mining is proposed to extract product features from pros and cons in a particular type of reviews. such features form the basis for the above comparison. experimental results show that the technique is highly effective and outperform existing methods significantly. surrounding text:the choice of using movies rather than other products in our study is mainly due to data availability, in that the daily box office revenue data are all published on the web and readily available, unlike other product sales data which are often private to their respective companies due to obvious reasons. also, as discussed by liu et al$ [***]<3>, analyzing movie reviews is one of the most challenging tasks in sentiment mining. we expect the models and algorithms developed for box office prediction to be easily adapted to handle other types of products that are subject to online discussions, such as books, music cds and electronics influence:3 type:3 pair index:206 citer id:163 citer title:a sentiment-aware model for predicting sales citer abstract:due to its high popularity,weblogs (or blogs in short) present a wealth of information that can be very helpful in assessing the general public��s sentiments and opinions. in this paper, we study the problem of mining sentiment information from blogs and investigate ways to use such information for predicting product sales performance. based on an analysis of the complex nature of sentiments, we propose sentiment plsa (s-plsa), in which a blog entry is viewed as a document generated by a number of hidden sentiment factors. training an s-plsa model on the blog data enables us to obtain a succinct summary of the sentiment information embedded in the blogs. we then present arsa, an autoregressive sentiment-aware model, to utilize the sentiment information captured by s-plsa for predicting product sales performance. extensive experiments were conducted on a movie data set. we compare arsa with alternative models that do not take into account the sentiment information, as well as a model with a different feature selection method. experiments confirm the effectiveness and superiority of the proposed approach citee id:151 citee title:a probabilistic approach to spatiotemporal theme pattern mining on weblogs citee abstract:mining subtopics from weblogs and analyzing their spatiotemporal patterns have applications in multiple domains. in this paper, we define the novel problem of mining spatiotemporal theme patterns from weblogs and propose a novel probabilistic approach to model the subtopic themes and spatiotemporal theme patterns simultaneously. the proposed model discovers spatiotemporal theme patterns by (1) extracting common themes from weblogs; (2) generating theme life cycles for each given location; and (3) generating theme snapshots for each given time period. evolution of patterns can be discovered by comparative analysis of theme life cycles and theme snapshots. experiments on three different data sets show that the proposed approach can discover interesting spatiotemporal theme patterns effectively. the proposed probabilistic model is general and can be used for spatiotemporal text mining on any domain with time and location information. surrounding text:they therefore believe that the temporal information of topics may help to forecast spike patterns in sales rank. the other direction focuses on analyzing the contents of blogs [14, 17, ***, 3]<2>. mei et al influence:3 type:2 pair index:207 citer id:163 citer title:a sentiment-aware model for predicting sales citer abstract:due to its high popularity,weblogs (or blogs in short) present a wealth of information that can be very helpful in assessing the general public��s sentiments and opinions. in this paper, we study the problem of mining sentiment information from blogs and investigate ways to use such information for predicting product sales performance. based on an analysis of the complex nature of sentiments, we propose sentiment plsa (s-plsa), in which a blog entry is viewed as a document generated by a number of hidden sentiment factors. training an s-plsa model on the blog data enables us to obtain a succinct summary of the sentiment information embedded in the blogs. we then present arsa, an autoregressive sentiment-aware model, to utilize the sentiment information captured by s-plsa for predicting product sales performance. extensive experiments were conducted on a movie data set. we compare arsa with alternative models that do not take into account the sentiment information, as well as a model with a different feature selection method. experiments confirm the effectiveness and superiority of the proposed approach citee id:139 citee title:a mixture model for contextual text mining citee abstract:contextual text mining is concerned with extracting topical themes from a text collection with context information (e.g., time and location) and comparing/analyzing the variations of themes over different contexts. since the topics covered in a document are usually related to the context of the document, analyzing topical themes within context can potentially reveal many interesting theme patterns. in this paper, we generalize some of these models proposed in the previous work and we propose a new general probabilistic model for contextual text mining that can cover several existing models as special cases. specifically, we extend the probabilistic latent semantic analysis (plsa) model by introducing context variables to model the context of a document. the proposed mixture model, called contextual probabilistic latent semantic analysis (cplsa) model, can be applied to many interesting mining tasks, such as temporal text mining, spatiotemporal text mining, author-topic analysis, and cross-collection comparative analysis. empirical experiments show that the proposed mixture model can discover themes and their contextual variations effectively surrounding text:they therefore believe that the temporal information of topics may help to forecast spike patterns in sales rank. the other direction focuses on analyzing the contents of blogs [14, ***, 16, 3]<2>. mei et al. mei et al. [***]<2> considers blog as a mixture of unigram language models, with each component corresponding to a distinct subtopic or theme. to analyze spatiotemporal theme patterns from blogs, location variable and theme snapshot for each given time period are integrated in the model influence:3 type:2 pair index:208 citer id:163 citer title:a sentiment-aware model for predicting sales citer abstract:due to its high popularity,weblogs (or blogs in short) present a wealth of information that can be very helpful in assessing the general public��s sentiments and opinions. in this paper, we study the problem of mining sentiment information from blogs and investigate ways to use such information for predicting product sales performance. based on an analysis of the complex nature of sentiments, we propose sentiment plsa (s-plsa), in which a blog entry is viewed as a document generated by a number of hidden sentiment factors. training an s-plsa model on the blog data enables us to obtain a succinct summary of the sentiment information embedded in the blogs. we then present arsa, an autoregressive sentiment-aware model, to utilize the sentiment information captured by s-plsa for predicting product sales performance. extensive experiments were conducted on a movie data set. we compare arsa with alternative models that do not take into account the sentiment information, as well as a model with a different feature selection method. experiments confirm the effectiveness and superiority of the proposed approach citee id:174 citee title:a sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts citee abstract:sentiment analysis seeks to identify the viewpoint(s) underlying a text span; an example application is classifying a movie review as "thumbs up" or "thumbs down". to determine this sentiment polarity, we propose a novel machine-learning method that applies text-categorization techniques to just the subjective portions of the document. extracting these portions can be implemented using efficient techniques for finding minimum cuts in graphs; this greatly facilitates incorporation of cross-sentence contextual constraints surrounding text:[20]<2> employ three machine learning approaches (naive bayes, maximum entropy, and support vector machine) to label the polarity of imdb movie reviews. in a follow up work, they propose to firstly extract the subjective portion of text with a graph min-cut algorithm, and then feed them into the sentiment classifier [***]<2>. instead of applying the straightforward frequency-based bag-of-words feature selection methods, whitelaw et al influence:3 type:2 pair index:209 citer id:163 citer title:a sentiment-aware model for predicting sales citer abstract:due to its high popularity,weblogs (or blogs in short) present a wealth of information that can be very helpful in assessing the general public��s sentiments and opinions. in this paper, we study the problem of mining sentiment information from blogs and investigate ways to use such information for predicting product sales performance. based on an analysis of the complex nature of sentiments, we propose sentiment plsa (s-plsa), in which a blog entry is viewed as a document generated by a number of hidden sentiment factors. training an s-plsa model on the blog data enables us to obtain a succinct summary of the sentiment information embedded in the blogs. we then present arsa, an autoregressive sentiment-aware model, to utilize the sentiment information captured by s-plsa for predicting product sales performance. extensive experiments were conducted on a movie data set. we compare arsa with alternative models that do not take into account the sentiment information, as well as a model with a different feature selection method. experiments confirm the effectiveness and superiority of the proposed approach citee id:117 citee title:seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales citee abstract:we address the rating-inference problem, wherein rather than simply decide whether a review is "thumbs up" or "thumbs down", as in previous sentiment analysis work, one must determine an author's evaluation with respect to a multi-point scale (e.g., one to five "stars"). this task represents an interesting twist on standard multi-class text categorization because there are several different degrees of similarity between class labels; for example, "three stars" is intuitively closer to "four stars" than to "one star". we first evaluate human performance at the task. then, we apply a meta-algorithm, based on a metric labeling formulation of the problem, that alters a given n-ary classifier's output in an explicit attempt to ensure that similar items receive similar labels. we show that the meta-algorithm can provide significant improvements over both multi-class and regression versions of svms when we employ a novel similarity measure appropriate to the problem. surrounding text:pushing further from the explicit two-class classification problem, pang et al. [***]<2> and zhang [25]<2> attempt to determine the author��s opinion with different rating scales (i$e$, the number of stars). liu et al influence:3 type:2 pair index:210 citer id:163 citer title:a sentiment-aware model for predicting sales citer abstract:due to its high popularity,weblogs (or blogs in short) present a wealth of information that can be very helpful in assessing the general public��s sentiments and opinions. in this paper, we study the problem of mining sentiment information from blogs and investigate ways to use such information for predicting product sales performance. based on an analysis of the complex nature of sentiments, we propose sentiment plsa (s-plsa), in which a blog entry is viewed as a document generated by a number of hidden sentiment factors. training an s-plsa model on the blog data enables us to obtain a succinct summary of the sentiment information embedded in the blogs. we then present arsa, an autoregressive sentiment-aware model, to utilize the sentiment information captured by s-plsa for predicting product sales performance. extensive experiments were conducted on a movie data set. we compare arsa with alternative models that do not take into account the sentiment information, as well as a model with a different feature selection method. experiments confirm the effectiveness and superiority of the proposed approach citee id:118 citee title:thumbs up? sentiment classification using machine learning techniques citee abstract:we consider the problem of classifying documents not by topic, but by overall sentiment, e.g., determining whether a review is positive or negative. using movie reviews as data, we find that standard machine learning techniques definitively outperform human-produced baselines. however, the three machine learning methods we employed (naive bayes, maximum entropy classification, and support vector machines) do not perform as well on sentiment classification as on traditional topic-based categorization. we conclude by examining factors that make the sentiment classification problem more challenging. surrounding text:pang et al. [***]<2> employ three machine learning approaches (naive bayes, maximum entropy, and support vector machine) to label the polarity of imdb movie reviews. in a follow up work, they propose to firstly extract the subjective portion of text with a graph min-cut algorithm, and then feed them into the sentiment classifier [18]<2> influence:3 type:2 pair index:211 citer id:164 citer title:latent dirichlet allocation citer abstract:we describe latent dirichlet allocation (lda), a generative probabilistic model for collections of discrete data such as text corpora. lda is a three-level hierarchical bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. in the context of text modeling, the topic probabilities provide an explicit representation of a document. we present efficient approximate inference techniques based on variational methods and an em algorithm for empirical bayes parameter estimation. we report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic lsi model citee id:193 citee title:a variational bayesian framework for graphical models citee abstract:this paper presents a novel practical framework for bayesian modelaveraging and model selection in probabilistic graphical modelsfiour approach approximates full posterior distributions over modelparameters and structures, as well as latent variables, in an analyticalmannerfi these posteriors fall out of a free-form optimizationprocedure, which naturally incorporates conjugate priorsfi unlikein large sample approximations, the posteriors are generally nongaussianand no hessian needs surrounding text:3 and consider a fuller bayesian approach to lda. we consider a variational approach to bayesian inference that places a separable distribution on the random variables b, q, and z (attias, 2000)[***]<1>: q(b1:k. z1:m influence:2 type:1 pair index:212 citer id:164 citer title:latent dirichlet allocation citer abstract:we describe latent dirichlet allocation (lda), a generative probabilistic model for collections of discrete data such as text corpora. lda is a three-level hierarchical bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. in the context of text modeling, the topic probabilities provide an explicit representation of a document. we present efficient approximate inference techniques based on variational methods and an em algorithm for empirical bayes parameter estimation. we report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic lsi model citee id:245 citee title:modern information retrieval citee abstract:information retrieval (ir) has changed considerably in the last years with the expansion of the web (world wide web) and the advent of modern and inexpensive graphical user interfaces and mass storage devices. as a result, traditional ir textbooks have become quite out-of-date which has led to the introduction of new ir books recently. nevertheless, we believe that there is still great need of a book that approaches the field in a rigorous and complete way from a computer-science perspective (in opposition to a user-centered perspective). this book is an effort to partially fulfill this gap and should be useful for a first course on information retrieval as well as for a graduate course on the topic. these www pages are not a digital version of the book, nor the complete contents of it. here you will find the preface, table of contents, glossary and two chapters available for reading on-line. the printed version can be ordered directly from addison-wesley-longman. surrounding text:the goal is to find short descriptions of the members of a collection that enable efficient processing of large collections while preserving the essential statistical relationships that are useful for basic tasks such as classification, novelty detection, summarization, and similarity and relevance judgments. significant progress has been made on this problem by researchers in the field of information retrieval (ir) (baeza-yates and ribeiro-neto, 1999)[***]<1>. the basic methodology proposed by ir researchers for text corpora—a methodology successfully deployed in modern internet search engines—reduces each document in the corpus to a vector of real numbers, each of which represents ratios of counts influence:2 type:3 pair index:213 citer id:164 citer title:latent dirichlet allocation citer abstract:we describe latent dirichlet allocation (lda), a generative probabilistic model for collections of discrete data such as text corpora. lda is a three-level hierarchical bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. in the context of text modeling, the topic probabilities provide an explicit representation of a document. we present efficient approximate inference techniques based on variational methods and an em algorithm for empirical bayes parameter estimation. we report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic lsi model citee id:682 citee title:modeling annotated data citee abstract:we consider the problem of modeling annotated datadata with multiple types where the instance of one type (such as a caption) serves as a description of the other type (such as an image). we describe three hierarchical probabilistic mixture models which aim to describe such data, culminating in correspondence latent dirichlet allocation, a latent variable model that is e ective at modeling the joint distribution of both types and the conditional distribution of the annotation given the primary type. we conduct experiments on the corel database of images and captions, assessing performance in terms of held-out likelihood, automatic annotation, and text-based image retrieval surrounding text:as a probabilistic module, lda can be readily embedded in a more complex model�a property that is not possessed by lsi. in recent work we have used pairs of lda modules to model relationships between images and their corresponding descriptive captions (blei and jordan, 2002)[***]<1>. moreover, there are numerous possible extensions of lda influence:2 type:2 pair index:214 citer id:164 citer title:latent dirichlet allocation citer abstract:we describe latent dirichlet allocation (lda), a generative probabilistic model for collections of discrete data such as text corpora. lda is a three-level hierarchical bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. in the context of text modeling, the topic probabilities provide an explicit representation of a document. we present efficient approximate inference techniques based on variational methods and an em algorithm for empirical bayes parameter estimation. we report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic lsi model citee id:683 citee title:theory of probability citee abstract:probability theory is the branch of mathematics concerned with analysis of random phenomena. the central objects of probability theory are random variables, stochastic processes, and events: mathematical ions of non-deterministic events or measured quantities that may either be single occurrences or evolve over time in an apparently random fashion. although an individual coin toss or the roll of a die is a random event, if repeated many times the sequence of random events will exhibit certain statistical patterns, which can be studied and predicted. two representative mathematical results describing such patterns are the law of large numbers and the central limit theorem. as a mathematical foundation for statistics, probability theory is essential to many human activities that involve quantitative analysis of large sets of data. methods of probability theory also apply to description of complex systems given only partial knowledge of their state, as in statistical mechanics. a great discovery of twentieth century physics was the probabilistic nature of physical phenomena at atomic scales, described in quantum mechanics. surrounding text:the specific ordering of the documents in a corpus can also be neglected. a classic representation theorem due to de finetti (1990)[***]<3> establishes that any collection of exchangeable random variables has a representation as a mixture distribution—in general an infinite mixture. thus, if we wish to consider exchangeable representations for documents and words, we need to consider mixture models that capture the exchangeability of both words and documents influence:2 type:3 pair index:215 citer id:164 citer title:latent dirichlet allocation citer abstract:we describe latent dirichlet allocation (lda), a generative probabilistic model for collections of discrete data such as text corpora. lda is a three-level hierarchical bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. in the context of text modeling, the topic probabilities provide an explicit representation of a document. we present efficient approximate inference techniques based on variational methods and an em algorithm for empirical bayes parameter estimation. we report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic lsi model citee id:655 citee title:indexing by latent semantic analysis citee abstract:a new method for automatic indexing and retrieval is described. the approach is to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure� in order to improve the detection of relevant documents on the basis of terms found in queries. the particular technique used is singular-value decomposition, in which a large term by document matrix is decomposed into a set of ca. 100 or- thogonal factors from which the original matrix can be approximated by linear combination. documents are represented by ca. 100 item vectors of factor weights. queries are represented as pseudo-document vectors formed from weighted combinations of terms, and documents with supra-threshold cosine values are re- turned. initial tests find this completely automatic method for retrieval to be promising. surrounding text:while the tf-idf reduction has some appealing features—notably in its basic identification of sets of words that are discriminative for documents in the collection—the approach also provides a relatively small amount of reduction in description length and reveals little in the way of inter- or intradocument statistical structure. to address these shortcomings, ir researchers have proposed several other dimensionality reduction techniques, most notably latent semantic indexing (lsi) (deerwester et al$, 1990)[***]<1>. lsi uses a singular value decomposition of the x matrix to identify a linear subspace in the space of tf-idf features that captures most of the variance in the collection influence:3 type:1 pair index:216 citer id:164 citer title:latent dirichlet allocation citer abstract:we describe latent dirichlet allocation (lda), a generative probabilistic model for collections of discrete data such as text corpora. lda is a three-level hierarchical bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. in the context of text modeling, the topic probabilities provide an explicit representation of a document. we present efficient approximate inference techniques based on variational methods and an em algorithm for empirical bayes parameter estimation. we report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic lsi model citee id:277 citee title:bayesian methods for censored categorical data citee abstract:bayesian methods are given for finite-category sampling when some of the observations suffer missing category distinctions. dickey's (1983) generalization of the dirichlet family of prior distributions is found to be closed under such censored sampling. the posterior moments and predictive probabilities are proportional to ratios of b. c. carlson's multiple hypergeometric functions. closed-form expressions are developed for the case of nested reported sets, when bayesian estimates can be computed easily from relative frequencies. effective computational methods are also given in the general case. an example involving surveys of death-penalty attitudes is used throughout to illustrate the theory. a simple special case of categorical missing data is a two-way contingency table with cross-classified count data xsubij/sub (i = 1,..., r; j = 1,..., c), together with supplementary trials counted only in the margin distinguishing the rows, ysubi/sub(i = 1,..., r). there could also be further supplementary trials reported only by counts distinguishing the columns, zsubj/sub(j = 1,..., c). under assumptions that the censoring process itself is "noninformative" regarding the category probabilities θsubij/sub (e.g., the report for each possible outcome might be nonrandom and prespecified), the bayesian inference regarding the θsubij/sub's would be based on the likelihood function. (Πsubij/sub thetasupxsubij/sub/supsubij/sub) . such a likelihood is ordinarily considered intractable and unsuited for bayesian conjugate prior inference. we develop a bayesian conjugate theory, however, by recognizing the complete integrals of such functions as carlson functions and the posterior distributions resulting from dirichlet prior distributions as known generalized dirichlet distributions. the corresponding posterior density functions are similar in form to the likelihood, and these constitute a family of distributions closed under sampling and tractable in various senses, including the convenient computability of moments and modes. surrounding text:it has been used in a bayesian context for censored discrete data to represent the posterior on q which, in that setting, is a random parameter (dickey et al. , 1987)[***]<1>. although the posterior distribution is intractable for exact inference, a wide variety of approximate inference algorithms can be considered for lda, including laplace approximation, variational approximation, and markov chain monte carlo (jordan, 1999)[19]<1> influence:3 type:2 pair index:217 citer id:164 citer title:latent dirichlet allocation citer abstract:we describe latent dirichlet allocation (lda), a generative probabilistic model for collections of discrete data such as text corpora. lda is a three-level hierarchical bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. in the context of text modeling, the topic probabilities provide an explicit representation of a document. we present efficient approximate inference techniques based on variational methods and an em algorithm for empirical bayes parameter estimation. we report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic lsi model citee id:274 citee title:bayesian data analysis citee abstract:incorporating new and updated information, this second edition of the bestselling text in bayesian data analysis continues to emphasize practice over theory, describing how to conceptualize, perform, and critiques statistical analysis from a bayesian perspective. changes in the new edition include: added material on how bayesian methods are connected to other approaches, stronger focus on mcmc, added chapter on further computation topics, more examples, and additional chapters on current models for bayesian data analysis such as equation models, generalized linear mixed models, and more. the book is an introductory text and a reference for working scientists throughout their professional life. surrounding text:structures similar to that shown in figure 1 are often studied in bayesian statistical modeling, where they are referred to as hierarchical models (gelman et al. , 1995)[***]<1>, or more precisely as conditionally independent hierarchical models (kass and steffey, 1989)[21]<1>. such models are also often referred to as parametric empirical bayes models, a term that refers not only to a particular model structure, but also to the methods used for estimating parameters in the model (morris, 1983)[25]<1> influence:2 type:2 pair index:218 citer id:164 citer title:latent dirichlet allocation citer abstract:we describe latent dirichlet allocation (lda), a generative probabilistic model for collections of discrete data such as text corpora. lda is a three-level hierarchical bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. in the context of text modeling, the topic probabilities provide an explicit representation of a document. we present efficient approximate inference techniques based on variational methods and an em algorithm for empirical bayes parameter estimation. we report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic lsi model citee id:150 citee title:a probabilistic approach to semantic representation citee abstract:: semantic networks produced from human data havestatistical properties that cannot be easily capturedby spatial representationsfi we explore a probabilisticapproach to semantic representation that explicitlymodels the probability with which words occurin di#erent contexts, and hence captures the probabilisticrelationships between wordsfi we show thatthis representation has statistical properties consistentwith the large-scale structure of semantic networksconstructed by humans, surrounding text:it is also possible to achieve higher accuracy by dispensing with the requirement of maintaining a bound, and indeed minka and lafferty (2002)[24]<2> have shown that improved inferential accuracy can be obtained for the lda model via a higher-order variational technique known as expectation propagation. finally, griffiths and steyvers (2002)[***]<2> have presented a markov chain monte carlo algorithm for lda. lda is a simple model, and although we view it as a competitor to methods such as lsi and plsi in the setting of dimensionality reduction for document collections and other discrete corpora, it is also intended to be illustrative of the way in which probabilistic models can be scaled up to provide useful inferential machinery in domains involving multiple levels of structure influence:1 type:2 pair index:219 citer id:164 citer title:latent dirichlet allocation citer abstract:we describe latent dirichlet allocation (lda), a generative probabilistic model for collections of discrete data such as text corpora. lda is a three-level hierarchical bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. in the context of text modeling, the topic probabilities provide an explicit representation of a document. we present efficient approximate inference techniques based on variational methods and an em algorithm for empirical bayes parameter estimation. we report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic lsi model citee id:684 citee title:overview of the first text retrieval conference (trec-1) citee abstract:the first text retrieval conference (trec-1) was held in early november 1992 and was attended by about 100 people working in the 25 participating groups. the goal of the conference was to bring research groups together to discuss their work on a new large test collection. there was a large variety of retrieval techniques reported on, including methods using automatic thesaurii, sophisticated term weighting, natural language techniques, relevance feedback, and advanced pattern matching. as results had been run through a common evaluation package, groups were able to compare the effectiveness of different techniques, and discuss how differences among the sytems affected performance surrounding text:example in this section, we provide an illustrative example of the use of an lda model on real data. our data are 16,000 documents from a subset of the trec ap corpus (harman, 1992)[***]<3>. after removing a standard list of stop words, we used the em algorithm described in section 5 influence:3 type:3 pair index:220 citer id:164 citer title:latent dirichlet allocation citer abstract:we describe latent dirichlet allocation (lda), a generative probabilistic model for collections of discrete data such as text corpora. lda is a three-level hierarchical bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. in the context of text modeling, the topic probabilities provide an explicit representation of a document. we present efficient approximate inference techniques based on variational methods and an em algorithm for empirical bayes parameter estimation. we report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic lsi model citee id:223 citee title:an experimental comparison of several clustering and initialization methods citee abstract:we examine methods for clustering in high dimensions. in the first part of the paper, we perform an experimental comparison between three batch clustering algorithms: the expectation--maximization (em) algorithm, a "winner take all" version of the em algorithm reminiscent of the k-means algorithm, and model-based hierarchical agglomerative clustering. we learn naive-bayes models with a hidden root node, using highdimensional discrete-variable data sets (both real and synthetic). surrounding text:in our experiments, we initialize em by seeding each conditional multinomial distribution with five documents, reducing their effective total length to two words, and smoothing across the whole vocabulary. this is essentially an approximation to the scheme described in heckerman and meila (2001)[***]<3>. 7 influence:3 type:3 pair index:221 citer id:164 citer title:latent dirichlet allocation citer abstract:we describe latent dirichlet allocation (lda), a generative probabilistic model for collections of discrete data such as text corpora. lda is a three-level hierarchical bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. in the context of text modeling, the topic probabilities provide an explicit representation of a document. we present efficient approximate inference techniques based on variational methods and an em algorithm for empirical bayes parameter estimation. we report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic lsi model citee id:427 citee title:probabilistic latent semantic indexing citee abstract:probabilistic latent semantic indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data. fitted from a training corpus of text documents by a generalization of the expectation maximization algorithm, the utilized model is able to deal with domain{specific synonymy as well as with polysemous words. in contrast to standard latent semantic indexing (lsi) by singular value decomposition, the probabilistic variant has a solid statistical foundation and defines a proper generative data model. retrieval experiments on a number of test collections indicate substantial performance gains over direct term matching methods as well as over lsi. in particular, the combination of models with di erent dimensionalities has proven to be advantageous surrounding text:given a generative model of text, however, it is not clear why one should adopt the lsi methodology—one can attempt to proceed more directly, fitting the model to data using maximum likelihood or bayesian methods. a significant step forward in this regard was made by hofmann (1999)[***]<1>, who presented the probabilistic lsi (plsi) model, also known as the aspect model, as an alternative to lsi. the plsi approach, which we describe in detail in section 4 influence:2 type:1 pair index:222 citer id:164 citer title:latent dirichlet allocation citer abstract:we describe latent dirichlet allocation (lda), a generative probabilistic model for collections of discrete data such as text corpora. lda is a three-level hierarchical bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. in the context of text modeling, the topic probabilities provide an explicit representation of a document. we present efficient approximate inference techniques based on variational methods and an em algorithm for empirical bayes parameter estimation. we report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic lsi model citee id:685 citee title:statistical methods for speech recognition citee abstract:this book reflects decades of important research on the mathematical foundations of speech recognition. it focuses on underlying statistical techniques such as hidden markov models, decision trees, the expectation-maximization algorithm, information theoretic goodness criteria, maximum entropy probability estimation, parameter and data clustering, and smoothing of probability distributions. the author's goal is to present these principles clearly in the simplest setting, to show the advantages of self-organization from real data, and to enable the reader to apply the techniques surrounding text:maximum likelihood estimates of the multinomial parameters assign zero probability to such words, and thus zero probability to new documents. the standard approach to coping with this problem is to “smooth�the multinomial parameters, assigning positive probability to all vocabulary items whether or not they are observed in the training set (jelinek, 1997)[***]<1>. laplace smoothing is commonly used influence:3 type:3 pair index:223 citer id:164 citer title:latent dirichlet allocation citer abstract:we describe latent dirichlet allocation (lda), a generative probabilistic model for collections of discrete data such as text corpora. lda is a three-level hierarchical bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. in the context of text modeling, the topic probabilities provide an explicit representation of a document. we present efficient approximate inference techniques based on variational methods and an em algorithm for empirical bayes parameter estimation. we report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic lsi model citee id:632 citee title:making large-scale svm learning practical citee abstract:training a support vector machine (svm) leads to a quadratic optimization problem with bound constraints and one linear equality constraint. despite the fact that this type of problem is well understood, there are many issues to be considered in designing an svm learner. in particular, for large learning tasks with many training examples, off-the-shelf optimization techniques for general quadratic programs quickly become intractable in their memory and time requirements. svmlight is an surrounding text:a challenging aspect of the document classification problem is the choice of features. treating individual words as features yields a rich but very large feature set (joachims, 1999)[***]<2>. one way to reduce this feature set is to use an lda model for dimensionality reduction. we then trained a support vector machine (svm) on the low-dimensional representations provided by lda and compared this svm to an svm trained on all the word features. using the svmlight software package (joachims, 1999)[***]<2>, we compared an svm trained on all the word features with those trained on features induced by a 50-topic lda model. note that we reduce the feature space by 99 influence:3 type:3 pair index:224 citer id:164 citer title:latent dirichlet allocation citer abstract:we describe latent dirichlet allocation (lda), a generative probabilistic model for collections of discrete data such as text corpora. lda is a three-level hierarchical bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. in the context of text modeling, the topic probabilities provide an explicit representation of a document. we present efficient approximate inference techniques based on variational methods and an em algorithm for empirical bayes parameter estimation. we report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic lsi model citee id:686 citee title:learning in graphical models citee abstract:graphical models, a marriage between probability theory and graph theory, provide a natural tool for dealing with two problems that occur throughout applied mathematics and engineering--uncertainty and complexity. in particular, they play an increasingly important role in the design and analysis of machine learning algorithms. fundamental to the idea of a graphical model is the notion of modularity: a complex system is built by combining simpler parts. probability theory serves as the glue whereby the parts are combined, ensuring that the system as a whole is consistent and providing ways to interface models to data. graph theory provides both an intuitively appealing interface by which humans can model highly interacting sets of variables and a data structure that lends itself naturally to the design of efficient general-purpose algorithms. this book presents an in-depth exploration of issues related to learning within the graphical model formalism. four chapters are tutorial chapters--robert cowell on inference for bayesian networks, david mackay on monte carlo methods, michael i. jordan et al. on variational methods, and david heckerman on learning with bayesian networks. the remaining chapters cover a wide range of topics of current research interest. surrounding text:, 1987)[11]<1>. although the posterior distribution is intractable for exact inference, a wide variety of approximate inference algorithms can be considered for lda, including laplace approximation, variational approximation, and markov chain monte carlo (jordan, 1999)[***]<1>. in this section we describe a simple convexity-based variational algorithm for inference in lda, and discuss some of the alternatives in section 8 influence:2 type:2 pair index:225 citer id:164 citer title:latent dirichlet allocation citer abstract:we describe latent dirichlet allocation (lda), a generative probabilistic model for collections of discrete data such as text corpora. lda is a three-level hierarchical bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. in the context of text modeling, the topic probabilities provide an explicit representation of a document. we present efficient approximate inference techniques based on variational methods and an em algorithm for empirical bayes parameter estimation. we report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic lsi model citee id:229 citee title:an introduction to variational methods for graphical models citee abstract:this paper presents a tutorial introduction to the use of variational methods for inference and learningin graphical models (bayesian networks and markov random fields). we present a number of examples of graphicalmodels, including the qmr-dt database, the sigmoid belief network, the boltzmann machine, and several variantsof hidden markov models, in which it is infeasible to run exact inference algorithms. we then introduce variationalmethods, which exploit laws of large numbers to transform the original graphical model into a simplified graphicalmodel in which inference is efficient. inference in the simpified model provides bounds on probabilities of interestin the original model. we describe a general framework for generating variational transformations based on convexduality. finally we return to the examples and demonstrate how variational algorithms can be formulated in eachcase. surrounding text:2 variational inference the basic idea of convexity-based variational inference is to make use of jensen’s inequality to obtain an adjustable lower bound on the log likelihood (jordan et al. , 1999)[***]<1>. essentially, one considers a family of lower bounds, indexed by a set of variational parameters influence:3 type:1 pair index:226 citer id:164 citer title:latent dirichlet allocation citer abstract:we describe latent dirichlet allocation (lda), a generative probabilistic model for collections of discrete data such as text corpora. lda is a three-level hierarchical bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. in the context of text modeling, the topic probabilities provide an explicit representation of a document. we present efficient approximate inference techniques based on variational methods and an em algorithm for empirical bayes parameter estimation. we report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic lsi model citee id:581 citee title:general lower bounds based on computer generated higher order expansions citee abstract:in this article we show the rough outline of a computer algorithm to generate lower bounds on the exponential function of (in principle) arbitrary precisionfi we implemented this to generate all necessary analytic terms for the boltzmann machine partition function thus leading to lower bounds of any orderfi it turns out that the extra variational parameters can be optimized analyticallyfi we show that bounds upto nineth order are still reasonably calculable in practical situationsfi the generated surrounding text:other approaches that might be considered include laplace approximation, higher-order variational techniques, and monte carlo methods. in particular, leisink and kappen (2002)[***]<2> have presented a general methodology for converting low-order variational lower bounds into higher-order variational bounds. it is also possible to achieve higher accuracy by dispensing with the requirement of maintaining a bound, and indeed minka and lafferty (2002)[24]<2> have shown that improved inferential accuracy can be obtained for the lda model via a higher-order variational technique known as expectation propagation influence:2 type:2 pair index:227 citer id:164 citer title:latent dirichlet allocation citer abstract:we describe latent dirichlet allocation (lda), a generative probabilistic model for collections of discrete data such as text corpora. lda is a three-level hierarchical bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. in the context of text modeling, the topic probabilities provide an explicit representation of a document. we present efficient approximate inference techniques based on variational methods and an em algorithm for empirical bayes parameter estimation. we report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic lsi model citee id:533 citee title:expectation-propagation for the generative aspect model citee abstract:the generative aspect model is an extension of the multinomial model for text that allows word probabilities to vary stochastically across documents surrounding text:in particular, leisink and kappen (2002)[22]<2> have presented a general methodology for converting low-order variational lower bounds into higher-order variational bounds. it is also possible to achieve higher accuracy by dispensing with the requirement of maintaining a bound, and indeed minka and lafferty (2002)[***]<2> have shown that improved inferential accuracy can be obtained for the lda model via a higher-order variational technique known as expectation propagation. finally, griffiths and steyvers (2002)[13]<2> have presented a markov chain monte carlo algorithm for lda influence:2 type:2 pair index:228 citer id:164 citer title:latent dirichlet allocation citer abstract:we describe latent dirichlet allocation (lda), a generative probabilistic model for collections of discrete data such as text corpora. lda is a three-level hierarchical bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. in the context of text modeling, the topic probabilities provide an explicit representation of a document. we present efficient approximate inference techniques based on variational methods and an em algorithm for empirical bayes parameter estimation. we report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic lsi model citee id:687 citee title:using maximum entropy for text classification citee abstract:this paper proposes the use of maximum entropytechniques for text classificationfi maximumentropy is a probability distribution estimationtechnique widely used for a variety ofnatural language tasks, such as language modeling,part-of-speech tagging, and text segmentationfithe underlying principle of maximumentropy is that without external knowledge, oneshould prefer distributions that are uniformficonstraints on the distribution, derived fromlabeled training data, inform maximum entropy surrounding text:nigam et al. , 1999)[***]<2>. in fact, by placing a dirichlet prior on the multinomial parameter we obtain an intractable posterior in the mixture model setting, for much the same reason that one obtains an intractable posterior in the basic lda model influence:3 type:2 pair index:229 citer id:164 citer title:latent dirichlet allocation citer abstract:we describe latent dirichlet allocation (lda), a generative probabilistic model for collections of discrete data such as text corpora. lda is a three-level hierarchical bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. in the context of text modeling, the topic probabilities provide an explicit representation of a document. we present efficient approximate inference techniques based on variational methods and an em algorithm for empirical bayes parameter estimation. we report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic lsi model citee id:688 citee title:text classification from labeled and unlabeled documents using em citee abstract:this paper shows that the accuracy of learned text classifiers can be improved byaugmenting a small number of labeled training documents with a large pool of unlabeled docu-ments. this is important because in many text classification problems obtaining training labelsis expensive, while large quantities of unlabeled documents are readily available.we introduce an algorithm for learning from labeled and unlabeled documents based on thecombination of expectation-maximization (em) and a naive bayes classifier. the algorithm firsttrains a classifier using the available labeled documents, and probabilistically labels the unlabeleddocuments. it then trains a new classifier using the labels for all the documents, and iteratesto convergence. this basic em procedure works well when the data conform to the generativeassumptions of the model. however these assumptions are often violated in practice, and poorperformance can result. we present two extensions to the algorithm that improve classificationaccuracy under these conditions: (1) a weighting factor to modulate the contribution of theunlabeled data, and (2) the use of multiple mixture components per class. experimental results,obtained using text from three different real-world tasks, show that the use of unlabeled datareduces classification error by up to 30%. surrounding text:2 mixture of unigrams if we augment the unigram model with a discrete random topic variable z (figure 3b), we obtain a mixture of unigrams model (nigam et al. , 2000)[***]<2>. under this mixture model, each document is generated by first choosing a topic z and then generating n words independently from the conditional multinomial p(wjz) influence:3 type:2 pair index:230 citer id:164 citer title:latent dirichlet allocation citer abstract:we describe latent dirichlet allocation (lda), a generative probabilistic model for collections of discrete data such as text corpora. lda is a three-level hierarchical bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. in the context of text modeling, the topic probabilities provide an explicit representation of a document. we present efficient approximate inference techniques based on variational methods and an em algorithm for empirical bayes parameter estimation. we report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic lsi model citee id:689 citee title:latent semantic indexing: a probabilistic analysis citee abstract:latent semantic indexing (lsi) is an information retrieval technique based on the spectralanalysis of the term-document matrix, whose empirical success had heretofore been withoutrigorous prediction and explanationfi we prove that, under certain conditions, lsi does succeedin capturing the underlying semantics of the corpus and achieves improved retrieval performancefiwe also propose the technique of random projection as a way of speeding up lsifi we complementour theorems with surrounding text:to substantiate the claims regarding lsi, and to study its relative strengths and weaknesses, it is useful to develop a generative probabilistic model of text corpora and to study the ability of lsi to recover aspects of the generative model from data (papadimitriou et al. , 1998)[***]<1>. given a generative model of text, however, it is not clear why one should adopt the lsi methodology—one can attempt to proceed more directly, fitting the model to data using maximum likelihood or bayesian methods influence:2 type:1 pair index:231 citer id:164 citer title:latent dirichlet allocation citer abstract:we describe latent dirichlet allocation (lda), a generative probabilistic model for collections of discrete data such as text corpora. lda is a three-level hierarchical bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. in the context of text modeling, the topic probabilities provide an explicit representation of a document. we present efficient approximate inference techniques based on variational methods and an em algorithm for empirical bayes parameter estimation. we report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic lsi model citee id:690 citee title:probabilistic models for unified collaborative and content-based recommendation in sparse-data environments citee abstract:recommender systems leverage product and community information to target products to consumersfi researchers have developed collaborative recommenders, content-based recommenders, and a few hybrid systemsfi we propose a unified probabilistic framework for merging collaborative and content-based recommendationsfi we extend hofmann"s aspect model to incorporate three-way co-occurrence data among users, items, and item contentfi the relative influence of collaboration data versus content data is not surrounding text:it has been shown, however, that overfitting can occur even when tempering is used (popescul et al. , 2001)[***]<2>. lda overcomes both of these problems by treating the topic mixture weights as a k-parameter hidden random variable rather than a large set of individual parameters which are explicitly linked to the training set influence:3 type:2 pair index:232 citer id:164 citer title:latent dirichlet allocation citer abstract:we describe latent dirichlet allocation (lda), a generative probabilistic model for collections of discrete data such as text corpora. lda is a three-level hierarchical bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. in the context of text modeling, the topic probabilities provide an explicit representation of a document. we present efficient approximate inference techniques based on variational methods and an em algorithm for empirical bayes parameter estimation. we report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic lsi model citee id:643 citee title:improving multi-class text classification with naive bayes citee abstract:there are numerous text documents available in electronic formfi more and more are becoming available every dayfi such documents represent a massive amount of information that is easily accessiblefi seeking value in this huge collection requires organization; much of the work of organizing documents can be automated through text classificationfi the accuracy and our understanding of such systems greatly influences their usefulnessfi in this paper, we seek 1) to advance the understanding of commonly surrounding text:in the mixture of unigrams model, overfitting is a result of peaked posteriors in the training set. a phenomenon familiar in the supervised setting, where this model is known as the naive bayes model (rennie, 2001)[***]<2>. this leads to a nearly deterministic clustering of the training documents (in the e-step) which is used to determine the word probabilities in each mixture component (in the m-step) influence:3 type:2 pair index:233 citer id:164 citer title:latent dirichlet allocation citer abstract:we describe latent dirichlet allocation (lda), a generative probabilistic model for collections of discrete data such as text corpora. lda is a three-level hierarchical bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. in the context of text modeling, the topic probabilities provide an explicit representation of a document. we present efficient approximate inference techniques based on variational methods and an em algorithm for empirical bayes parameter estimation. we report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic lsi model citee id:566 citee title:introduction to modern information retrieval citee abstract:new technology now allows the design of sophisticated information retrieval system that can not only analyze, process and store, but can also retrieve specific resources matching a particular user’s needs. this clear and practical text relates the theory, techniques and tools critical to making information retrieval work. a completely revised second edition incorporates the latest developments in this rapidly expanding field, including multimedia information retrieval, user interfaces and digital libraries. chowdhury’s coverage is comprehensive, including classification, cataloging, subject indexing, ing, vocabulary control; cd-rom and online information retrieval; multimedia, hypertext and hypermedia; expert systems and natural language processing; user interface systems; internet, world wide web and digital library environments. illustrated with many examples and comprehensively referenced for an international audience, this is an ideal textbook for students of library and information studies and those professionals eager to advance their knowledge of the future of information. surrounding text:the basic methodology proposed by ir researchers for text corpora—a methodology successfully deployed in modern internet search engines—reduces each document in the corpus to a vector of real numbers, each of which represents ratios of counts. in the popular tf-idf scheme (salton and mcgill, 1983)[***]<1>, a basic vocabulary of “words�or “terms�is chosen, and, for each document in the corpus, a count is formed of the number of occurrences of each word. after suitable normalization, this term frequency count is compared to an inverse document frequency count, which measures the number of occurrences of a c 2003 david m influence:2 type:3 pair index:234 citer id:207 citer title:an analysis of weblog comments citer abstract:access to weblogs, both through commercial services and in academic studies, is usually limited to the content of the weblog posts. this overlooks an important aspect distinguishing weblogs from other web pages: the ability of weblog readers to respond to posts directly, by posting comments. in this paper we present a large-scale study of weblog comments and their relation to the posts. using a sizable corpus of comments, we estimate the overall volume of comments in the blogosphere; analyze the relation between the weblog popularity and commenting patterns in it; and measure the contribution of comment content to various aspects of weblog access citee id:208 citee title:an argumentation analysis of weblog conversations citee abstract:weblogs are important new components of the internet. they provide individual users with an easy way to publish online and others to comment on these views. furthermore, there is a suite of secondary applications that allow weblogs to be linked, searched, and navigated. although originally intended for individual use, in practice weblogs increasingly appear to facilitate distributed conversations. this could have important implications for the use of this technology as a medium for collaboration. given the special characteristics of weblogs and their supporting applications, they may be well suited for a range of conversational purposes that require different forms of argumentation. in this paper, we analyze the argumentation potential of weblog technologies, using a diagnostic framework for argumentation technologies. we pay special attention to the conversation structures and dynamics that weblogs naturally afford. based on this initial analysis, we make a number of recommendations for research on how to apply these technologies in purposeful conversation processes such as for knowledge management. surrounding text:krishnamurthy studies the posting patterns to a specific weblog following the september 11 events, finding that insightful posts attract the largest number of comments [7]<2>. de moor and efimova [***]<2> discuss weblog comments in a larger context of weblog conversations. among their findings is user frustration about the fragmentation of discussions between various weblog posts and associated comments, indicating that for users the comments are an inherent part of the weblog text, and they wish to access them as such influence:1 type:2 pair index:235 citer id:207 citer title:an analysis of weblog comments citer abstract:access to weblogs, both through commercial services and in academic studies, is usually limited to the content of the weblog posts. this overlooks an important aspect distinguishing weblogs from other web pages: the ability of weblog readers to respond to posts directly, by posting comments. in this paper we present a large-scale study of weblog comments and their relation to the posts. using a sizable corpus of comments, we estimate the overall volume of comments in the blogosphere; analyze the relation between the weblog popularity and commenting patterns in it; and measure the contribution of comment content to various aspects of weblog access citee id:209 citee title:blogpulse: automated trend discovery for weblogs citee abstract:over the past few years, weblogs have emerged as a new communication and publication medium on the internet. in this paper, we describe the application of data mining, information extraction and nlp algorithms for discovering trends across our subset of approximately 100,000 weblogs. we publish daily lists of key persons, key phrases, and key paragraphs to a public web site, blogpulse.com. in addition, we maintain a searchable index of weblog entries. on top of the search index, we have implemented trend search, which graphs the normalized trend line over time for a search query and provides a way to estimate the relative buzz of word of mouth for given topics over time. surrounding text:1. collect all weblog posts in the blogpulse [***]<1> index from the given period containing a permalink. 2 influence:3 type:3 pair index:236 citer id:207 citer title:an analysis of weblog comments citer abstract:access to weblogs, both through commercial services and in academic studies, is usually limited to the content of the weblog posts. this overlooks an important aspect distinguishing weblogs from other web pages: the ability of weblog readers to respond to posts directly, by posting comments. in this paper we present a large-scale study of weblog comments and their relation to the posts. using a sizable corpus of comments, we estimate the overall volume of comments in the blogosphere; analyze the relation between the weblog popularity and commenting patterns in it; and measure the contribution of comment content to various aspects of weblog access citee id:210 citee title:bridging the gap: a genre analysis of weblogs citee abstract:weblogs (blogs)—frequently modified web pages in which dated entries are listed in reverse chronological sequence—are the latest genre of internet communication to attain widespread popularity, yet their characteristics have not been systematically described. this paper presents the results of a content analysis of 203 randomly-selected weblogs, comparing the empirically observable features of the corpus with popular claims about the nature of weblogs, and finding them to differ in a number of respects. notably, blog authors, journalists and scholars alike exaggerate the extent to which blogs are interlinked, interactive, and oriented towards external events, and underestimate the importance of blogs as individualistic, intimate forms of self-expression. based on the profile generated by the empirical analysis, we consider the likely antecedents of the blog genre, situate it with respect to the dominant forms of digital communication on the internet today, and advance predictions about its long-term impacts. surrounding text:a single exception to this is the work of herring et. al [***]<2>, which examine a random sample of 203 weblogs. in this sample, a relatively small amount of comments is found (average of 0. reports on the amount of weblogs permitting comments are mixed. a low figure of 43% appears in the random sample examined in [***]<2>, while the community-related sample studied in [20]<2> shows that more than 90% of the weblogs enabled comments (both studies do not report on the actual number of commented weblogs). in our collection, a random sample of 500 weblogs shows that over 80% of weblogs allow users to add comments to the posts, but only 28% of weblogs actually had comments posted. in our collection, a random sample of 500 weblogs shows that over 80% of weblogs allow users to add comments to the posts, but only 28% of weblogs actually had comments posted. 2 the increase in comment prevalence, compared to [***]<2>, can be attributed to the development of blogging software in the 2. 5-year period between the two studies influence:1 type:2 pair index:237 citer id:207 citer title:an analysis of weblog comments citer abstract:access to weblogs, both through commercial services and in academic studies, is usually limited to the content of the weblog posts. this overlooks an important aspect distinguishing weblogs from other web pages: the ability of weblog readers to respond to posts directly, by posting comments. in this paper we present a large-scale study of weblog comments and their relation to the posts. using a sizable corpus of comments, we estimate the overall volume of comments in the blogosphere; analyze the relation between the weblog popularity and commenting patterns in it; and measure the contribution of comment content to various aspects of weblog access citee id:172 citee title:on the bursty evolution of blogspace citee abstract:we propose two new tools to address the evolution of hyperlinked corpora. first, we define time graphs to extend the traditional notion of an evolving directed graph, capturing link creation as a point phenomenon in time. second, we develop definitions and algorithms for time-dense community tracking, to crystallize the notion of community evolution. we develop these tools in the context of blogspace , the space of weblogs (or blogs). our study involves approximately 750k links among 25k blogs. we create a time graph on these blogs by an automatic analysis of their internal time stamps. we then study the evolution of connected component structure and microscopic community structure in this time graph. we show that blogspace underwent a transition behavior around the end of 2001, and has been rapidly expanding over the past year, not just in metrics of scale, but also in metrics of community structure and connectedness. this expansion shows no sign of abating, although measures of connectedness must plateau within two years. by randomizing link destinations in blogspace, but retaining sources and timestamps, we introduce a concept of randomized blogspace . herein, we observe similar evolution of a giant component, but no corresponding increase in community structure. having demonstrated the formation of micro-communities over time, we then turn to the ongoing activity within active communities. we extend recent work of kleinberg to discover dense periods of "bursty" intra-community link creation. surrounding text:extracting comments enriches the model of the social network of bloggers and readers, and may be used for enhancing studies of weblog communities and the interactions between them, including the burstiness work by kumar et. al [***]<2> and the in-depth analysis of a specific community done by wei [20]<2>. finally, weblog comments are a source of search engine optimization spam influence:2 type:2 pair index:238 citer id:207 citer title:an analysis of weblog comments citer abstract:access to weblogs, both through commercial services and in academic studies, is usually limited to the content of the weblog posts. this overlooks an important aspect distinguishing weblogs from other web pages: the ability of weblog readers to respond to posts directly, by posting comments. in this paper we present a large-scale study of weblog comments and their relation to the posts. using a sizable corpus of comments, we estimate the overall volume of comments in the blogosphere; analyze the relation between the weblog popularity and commenting patterns in it; and measure the contribution of comment content to various aspects of weblog access citee id:211 citee title:audience, structure and authority in the weblog community citee abstract:the weblog medium, while fundamentally an innovation in personal publishing has also come to engender a new form of social interaction on the web: a massively distributed but completely connected conversation covering every imaginable topic of interest. a byproduct of this ongoing communication is the set of hyperlinks made between weblogs in the exchange of dialog, a form of social acknowledgement on the part of authors. this paper seeks to understand the social implications of linking in the community, drawing from the hyperlink citations collected by the blogdex project over the past 3 years. social network analysis is employed to describe the resulting social structure, and two measures of authority are explored: popularity, as measured by webloggers�public affiliations and influence measured by citation of each others writing. these metrics are evaluated with respect to each other and with the authority conferred by references in the popular press. surrounding text:1. introduction weblog comments serve as “a simple and effective way for webloggers to interact with their readership�[***]<3>. they are one of the defining set of weblog characteristics [21]<3>, and most bloggers identify comment feedback as an important motivation for their writing [19, 4]<3> influence:2 type:3 pair index:239 citer id:207 citer title:an analysis of weblog comments citer abstract:access to weblogs, both through commercial services and in academic studies, is usually limited to the content of the weblog posts. this overlooks an important aspect distinguishing weblogs from other web pages: the ability of weblog readers to respond to posts directly, by posting comments. in this paper we present a large-scale study of weblog comments and their relation to the posts. using a sizable corpus of comments, we estimate the overall volume of comments in the blogosphere; analyze the relation between the weblog popularity and commenting patterns in it; and measure the contribution of comment content to various aspects of weblog access citee id:212 citee title:blocking blog spam with language model disagreement citee abstract:we present an approach for detecting link spam common in blog comments by comparing the language models used in the blog post, the comment, and pages linked by the comments. in contrast to other link spam filtering approaches, our method requires no training, no hard-coded rule sets, and no knowledge of complete-web connectivity. preliminary experiments with identification of typical blog spam show promising results surrounding text:finally, weblog comments are a source of search engine optimization spam. this is discussed in [***]<0>. 2 influence:2 type:3 pair index:240 citer id:207 citer title:an analysis of weblog comments citer abstract:access to weblogs, both through commercial services and in academic studies, is usually limited to the content of the weblog posts. this overlooks an important aspect distinguishing weblogs from other web pages: the ability of weblog readers to respond to posts directly, by posting comments. in this paper we present a large-scale study of weblog comments and their relation to the posts. using a sizable corpus of comments, we estimate the overall volume of comments in the blogosphere; analyze the relation between the weblog popularity and commenting patterns in it; and measure the contribution of comment content to various aspects of weblog access citee id:213 citee title:learning with skewed class distributions citee abstract:: several aspects may influence the performance achieved by a classifier created by a machine learning system. one of these aspects is related to the difference between the numbers of examples belonging to each class. when this difference is large, the learning system may have difficulties to learn the concept related to the minority class. in this work, we discuss several issues related to learning with skewed class distributions, such as the relationship between cost-sensitive learning and surrounding text:g. , [***]<3>) �a maximumlikelihood classifier would have achieved an overall f-score of 0. 84 by classifying all threads as non-disputative, but would have little meaning as a baseline as it would have yielded an f-score of 0 on the disputative comments only influence:3 type:3 pair index:241 citer id:207 citer title:an analysis of weblog comments citer abstract:access to weblogs, both through commercial services and in academic studies, is usually limited to the content of the weblog posts. this overlooks an important aspect distinguishing weblogs from other web pages: the ability of weblog readers to respond to posts directly, by posting comments. in this paper we present a large-scale study of weblog comments and their relation to the posts. using a sizable corpus of comments, we estimate the overall volume of comments in the blogosphere; analyze the relation between the weblog popularity and commenting patterns in it; and measure the contribution of comment content to various aspects of weblog access citee id:214 citee title:towards a robust metric of opinion citee abstract:this paper describes an automated system for detecting polar expressions about a topic of interest. the two elementary components of this approach are a shallow nlp polar language extraction system and a machine learning based topic classifier. these components are composed together by making a simple but accurate collocation assumption: if a topical sentence contains polar language, the system predicts that the polar language is reflective of the topic, and not some other subject matter. we evaluate our system, components and assumption on a corpus of online consumer messages. based on these components, we discuss how to measure the overall sentiment about a particular topic as expressed in online messages authored by many different people. we propose to use the fundamentals of bayesian statistics to form an aggregate authorial opinion metric. this metric would propagate uncertainties introduced by the polarity and topic modules to facilitate statistically valid comparisons of opinion across multiple topics. surrounding text:polarity. the sentiment analysis method described in [***]<1> was used to identify the orientation of the text of the comments. the intuition here is that disputes are more likely to have a negative tone than other types of discussion influence:3 type:1 pair index:242 citer id:207 citer title:an analysis of weblog comments citer abstract:access to weblogs, both through commercial services and in academic studies, is usually limited to the content of the weblog posts. this overlooks an important aspect distinguishing weblogs from other web pages: the ability of weblog readers to respond to posts directly, by posting comments. in this paper we present a large-scale study of weblog comments and their relation to the posts. using a sizable corpus of comments, we estimate the overall volume of comments in the blogosphere; analyze the relation between the weblog popularity and commenting patterns in it; and measure the contribution of comment content to various aspects of weblog access citee id:2 citee title:a bayesian approach to filtering junk e-mail citee abstract:in addressing the growing problem of junk e-mail on the internet, we examine methods for the automated construction of filters to eliminate such unwanted messages from a user's mail stream. by casting this problem in a decision theoretic framework, we are able to make use of probabilistic learning methods in conjunction with a notion of differential misclassification cost to produce filters which are especially appropriate for the nuances of this task. while this may appear, at first, to be a surrounding text:g. , [***]<3>). influence:3 type:3 pair index:243 citer id:207 citer title:an analysis of weblog comments citer abstract:access to weblogs, both through commercial services and in academic studies, is usually limited to the content of the weblog posts. this overlooks an important aspect distinguishing weblogs from other web pages: the ability of weblog readers to respond to posts directly, by posting comments. in this paper we present a large-scale study of weblog comments and their relation to the posts. using a sizable corpus of comments, we estimate the overall volume of comments in the blogosphere; analyze the relation between the weblog popularity and commenting patterns in it; and measure the contribution of comment content to various aspects of weblog access citee id:215 citee title:machine learning in automated text categorization citee abstract:the automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuing need to organize them. in the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. the advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. this survey discusses the main approaches to text categorization that fall within the machine learning paradigm. we will discuss in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation. surrounding text:feature set . frequency counts - the basic and most popular feature set used in text classification tasks [***]<2>. we used counts of words and word bigrams in the comments, as well as counts of a manually constructed small list of longer phrases typicallly used in debates (“i don’t think that�“you are wrong� and so on) influence:3 type:3 pair index:244 citer id:207 citer title:an analysis of weblog comments citer abstract:access to weblogs, both through commercial services and in academic studies, is usually limited to the content of the weblog posts. this overlooks an important aspect distinguishing weblogs from other web pages: the ability of weblog readers to respond to posts directly, by posting comments. in this paper we present a large-scale study of weblog comments and their relation to the posts. using a sizable corpus of comments, we estimate the overall volume of comments in the blogosphere; analyze the relation between the weblog popularity and commenting patterns in it; and measure the contribution of comment content to various aspects of weblog access citee id:216 citee title:what makes a weblog a weblog? citee abstract:at berkman we're studying weblogs, how they're used, and what they are. rather than saying "i know it when i see it" i wanted to list all the known features of weblog software, but more important, get to the heart of what a weblog is, and how a weblog is different from a wiki, or a news site managed with software like vignette or interwoven. i draw from my experience developing and using weblog software (manila, radio userland) and using competitive products such as blogger and movable type. this piece is being published along with my keynotes at oscom and the jupiter weblogs conference. and a disclaimer: this is a work in progress. there may be subsequent versions as the art and market for weblog software develops surrounding text:introduction weblog comments serve as “a simple and effective way for webloggers to interact with their readership�[9]<3>. they are one of the defining set of weblog characteristics [***]<3>, and most bloggers identify comment feedback as an important motivation for their writing [19, 4]<3>. despite this, comments are largely ignored in current studies of large amounts of weblog data, typically because extracting and processing their content is somewhat more complex than extracting the content of the posts themselves influence:1 type:2 pair index:245 citer id:243 citer title:automated ranking of database query results citer abstract:ranking and returning the most relevant results of a query is a popular paradigm in information retrieval. we discuss challenges and investigate several approaches to enable ranking in databases, including adaptations of known techniques from information retrieval. we present results of preliminary experiments citee id:244 citee title:dbxplorer: a system for keyword based search over relational databases citee abstract:internet search engines have popularized the keywordbased search paradigm. while traditional database management systems offer powerful query languages, they do not allow keyword-based search. in this paper, we discuss dbxplorer, a system that enables keywordbased search in relational databases. dbxplorer has been implemented using a commercial relational database and web server and allows users to interact via a browser front-end. we outline the challenges and discuss the implementation of our system including results of extensive experimental evaluation. surrounding text:the papers [10, 23, 24, 32]<2> employ relevance-feedback techniques for learning similarity in multimedia and relational databases. a keyword-based retrieval system over databases is proposed in [***]<2>. the distinguishing aspects of our work from the above are (a) we address the challenges that a heterogeneous mix of numeric as well as categorical attributes pose, and (b) we propose a novel and easy to implement ranking method based on query workload analysis influence:1 type:2 pair index:246 citer id:243 citer title:automated ranking of database query results citer abstract:ranking and returning the most relevant results of a query is a popular paradigm in information retrieval. we discuss challenges and investigate several approaches to enable ranking in databases, including adaptations of known techniques from information retrieval. we present results of preliminary experiments citee id:243 citee title:automated ranking of database query results citee abstract:ranking and returning the most relevant results of a query is a popular paradigm in information retrieval. we discuss challenges and investigate several approaches to enable ranking in databases, including adaptations of known techniques from information retrieval. we present results of preliminary experiments surrounding text:1/5, where s is the standard deviation of {t1, t2, � tn}. for theoretical justification of these extensions, see [***]<1>. 3. 2), and for breaking ties among tuples (section 5). further details of these extensions to ita may be found in [***]<1>. when our ranking order is over the result of a relational query, defined over a set of tables, additional challenges arise. in the equation, ri is the subject’s preference for the ith tuple in the ranked list returned by the ranking function (1 if it is marked relevant, and 0 otherwise). the intuition behind the r metric is that if relevant tuples are ranked low, they contribute less to the value of r with exponential decay (see [***]<1> for further discussion on the r metric). we next present the r metric values obtained in various quality experiments (r values are normalized by dividing by the maximum possible value for r) influence:1 type:1 pair index:247 citer id:243 citer title:automated ranking of database query results citer abstract:ranking and returning the most relevant results of a query is a popular paradigm in information retrieval. we discuss challenges and investigate several approaches to enable ranking in databases, including adaptations of known techniques from information retrieval. we present results of preliminary experiments citee id:245 citee title:modern information retrieval citee abstract:information retrieval (ir) has changed considerably in the last years with the expansion of the web (world wide web) and the advent of modern and inexpensive graphical user interfaces and mass storage devices. as a result, traditional ir textbooks have become quite out-of-date which has led to the introduction of new ir books recently. nevertheless, we believe that there is still great need of a book that approaches the field in a rigorous and complete way from a computer-science perspective (in opposition to a user-centered perspective). this book is an effort to partially fulfill this gap and should be useful for a first course on information retrieval as well as for a graduate course on the topic. these www pages are not a digital version of the book, nor the complete contents of it. here you will find the preface, table of contents, glossary and two chapters available for reading on-line. the printed version can be ordered directly from addison-wesley-longman. surrounding text:related work extracting ranking functions has been extensively investigated in areas outside database research such as information retrieval. the cosine similarity metric with tf-idf weighting of the vector space model [***]<1> is very successful in practice. we extend the tf-idf weighting technique for database ranking to handle a heterogeneous mix of numeric and categorical data. (in this formula we define qf(q) = (rqf(q)+1)/ (rqfmax+1) so that even if a value is never referenced in the workload, it gets a small non-zero qf). using multiplication to combine the two factors is inspired by the tf*idf factors in the original tf-idf ranking function [***]<1>. the resulting function noticeably improved ranking quality in certain cases (see section 7) influence:2 type:3 pair index:248 citer id:243 citer title:automated ranking of database query results citer abstract:ranking and returning the most relevant results of a query is a popular paradigm in information retrieval. we discuss challenges and investigate several approaches to enable ranking in databases, including adaptations of known techniques from information retrieval. we present results of preliminary experiments citee id:246 citee title:empirical analysis of predictive algorithms for collaborative filtering citee abstract:collaborative filtering or recommender systemsuse a database about user preferences topredict additional topics or products a newuser might likefi in this paper we describeseveral algorithms designed for this task, includingtechniques based on correlation coefficients,vector-based similarity calculations,and statistical bayesian methodsfi we comparethe predictive accuracy of the variousmethods in a set of representative problemdomainsfi we use two basic classes of evaluation surrounding text:we extend the tf-idf weighting technique for database ranking to handle a heterogeneous mix of numeric and categorical data. ranking is an important component in collaborative filtering research [***]<1>. these methods require training data using queries as well as their ranked results influence:3 type:1 pair index:249 citer id:243 citer title:automated ranking of database query results citer abstract:ranking and returning the most relevant results of a query is a popular paradigm in information retrieval. we discuss challenges and investigate several approaches to enable ranking in databases, including adaptations of known techniques from information retrieval. we present results of preliminary experiments citee id:247 citee title:top-k selection queries over relational databases: mapping strategies and performance evaluation citee abstract:in many applications, users specify target values for certain attributes, without requiring exact matches to these values in return. instead, the result to such queries is typically a rank of the "top k" tuples that best match the given attribute values. in this paper, we study the advantages and limitations of processing a top-k query by translating it into a single range query that a traditional relational database management system (rdbms) can process efficiently. in particular, we study how to determine a range query to evaluate a top-k query by exploiting the statistics available to an rdbms, and the impact of the quality of these statistics on the retrieval efficiency of the resulting scheme. we also report the first experimental evaluation of the mapping strategies over a real rdbms, namely over microsoft's sql server 7.0. the experiments show that our new techniques are robust and significantly more efficient than previously known strategies requiring at least one sequential scan of the data sets. surrounding text:a major concern of this paper is the query processing techniques for supporting ranking. several techniques have been previously developed in database research for the top-k problem [***, 7, 14, 15, 31]<1>. we adopt the algorithm in [15]<1> for our purposes, and discuss issues such as how the relational engine and indexes/materialized views can be leveraged for query performance influence:1 type:2 pair index:250 citer id:243 citer title:automated ranking of database query results citer abstract:ranking and returning the most relevant results of a query is a popular paradigm in information retrieval. we discuss challenges and investigate several approaches to enable ranking in databases, including adaptations of known techniques from information retrieval. we present results of preliminary experiments citee id:248 citee title:evaluating top-k queries over web-accessible databases citee abstract:a query to a web search engine usually consists of a list of keywords, to which the search engine responds with the best or “top�k pages for the query. this top-k query model is prevalent over multimedia collections in general, but also over plain relational data for certain applications. for example, consider a relation with information on available restaurants, including their location, price range for one diner, and overall food rating. a user who queries such a relation might simply specify the user’s location and target price range, and expect in return the best 10 restaurants in terms of some combination of proximity to the user, closeness of match to the target price range, and overall food rating. processing such top-k queries efficiently is challenging for a number of reasons. one critical such reason is that, in many web applications, the relation attributes might not be available other than through external web-accessible form interfaces, which we will have to query repeatedly for a potentially large set of candidate objects. in this paper, we study how to process topk queries efficiently in this setting, where the attributes for which users specify target values might be handled by external, autonomous sources with a variety of access interfaces. we present several algorithms for processing such queries, and evaluate them thoroughly using both synthetic and real web-accessible data. surrounding text:a major concern of this paper is the query processing techniques for supporting ranking. several techniques have been previously developed in database research for the top-k problem [6, ***, 14, 15, 31]<1>. we adopt the algorithm in [15]<1> for our purposes, and discuss issues such as how the relational engine and indexes/materialized views can be leveraged for query performance. however, we observe that the overlap similarity function (in fact, all similarity functions discussed in this paper) satisfies a useful monotonic property: if t and u are two tuples such that for all k, sk(tk, qk) fi sk(uk, qk), then sim(t, q) fi sim(u, q). this enables us to adapt fagin’s threshold algorithm (ta) and its derivatives [***, 15]<1> to retrieve the top-k tuples without having to process all tuples of the database. to adapt ta for our purposes, we have to implement two types of access methods: (a) sorted access along any attribute ak, in which tids of tuples can be efficiently retrieved one-by-one in order of decreasing similarity of their ak attribute value from qk, and (b) random access, in which the entire tuple corresponding to any given tid can be efficiently retrieved. we leverage available database indexes such as b+ trees to efficiently implement these two access methods. since it is unrealistic to assume that indexes are always present on all attributes specified by any query, we adapt a derivative of ta [***]<1> that works even if sorted access is not available on some attributes. our resulting adaptation, called the index-based threshold algorithm, or ita, is shown in figure 2 influence:1 type:1 pair index:251 citer id:243 citer title:automated ranking of database query results citer abstract:ranking and returning the most relevant results of a query is a popular paradigm in information retrieval. we discuss challenges and investigate several approaches to enable ranking in databases, including adaptations of known techniques from information retrieval. we present results of preliminary experiments citee id:249 citee title:on saying "enough already!" in sql citee abstract: in this paper, we study a simple sql extension that enables query writers to explicitly limit the cardinality of a query result we examine its impact on the query optimization and run - time execution components of a relational dbms, presenting two approaches - a conservative approach and an aggressive approach - to exploiting cardinality limits in relational query plans results obtained from an empirical study conducted using db2 demonstrate the benefits of the sql extension and illustrate the tradeoffs between our two approaches to implementing it surrounding text:, sort the relation to get top-k results. recent papers have focused on how to efficiently implement a sort_topk operator [***, 9]<2>. it is important to note that the assumed semantics of top-k is nondeterministic, i influence:2 type:2 pair index:252 citer id:243 citer title:automated ranking of database query results citer abstract:ranking and returning the most relevant results of a query is a popular paradigm in information retrieval. we discuss challenges and investigate several approaches to enable ranking in databases, including adaptations of known techniques from information retrieval. we present results of preliminary experiments citee id:250 citee title:reducing the braking distance of an sql query engine citee abstract:in a recent paper, we proposed adding a stop after clause tosql to permit the cardinality of a query result to be explicitly limitedby query writers and query toolsfi we demonstrated the usefulnessof having this clause, showed how to extend a traditionalcost-based query optimizer to accommodate it, and demonstratedvia db2-based simulations that large performance gains are possiblewhen stop after queries are explicitly supported by thedatabase enginefi in this paper, we present several new surrounding text:, sort the relation to get top-k results. recent papers have focused on how to efficiently implement a sort_topk operator [8, ***]<2>. it is important to note that the assumed semantics of top-k is nondeterministic, i influence:2 type:2 pair index:253 citer id:243 citer title:automated ranking of database query results citer abstract:ranking and returning the most relevant results of a query is a popular paradigm in information retrieval. we discuss challenges and investigate several approaches to enable ranking in databases, including adaptations of known techniques from information retrieval. we present results of preliminary experiments citee id:251 citee title:integration of heterogeneous databases without common domains using queries based on textual similarity citee abstract:most databases contain "name constants" like course numbers, personal names, and place names that correspond to entities in the real worldfi previous work in integration of heterogeneous databases has assumed that local name constants can be mapped into an appropriate global domain by normalizationfi however, in many cases, this assumption does not hold; determining if two name constants should be considered identical can require detailed knowledge of the world, the purpose of the user"s query, surrounding text:the early work of [21]<2> considered vague/imprecise similarity-based querying of databases. the problem of integrating databases and information retrieval systems has been attempted in several works [***, 13, 17, 18]<2>. information retrieval based approaches have been extended to xml retrieval in [26]<2> influence:3 type:2 pair index:254 citer id:243 citer title:automated ranking of database query results citer abstract:ranking and returning the most relevant results of a query is a popular paradigm in information retrieval. we discuss challenges and investigate several approaches to enable ranking in databases, including adaptations of known techniques from information retrieval. we present results of preliminary experiments citee id:252 citee title:providing database-like access to the web using queries based on textual similarity citee abstract:most databases contain “name constants�like course numbers, personal names, and place names that correspond to entities in the real world. previous work in integration of heterogeneous databases has assumed that local name constants can be mapped into an appropriate global domain by normalization. here we assume instead that the names are given in natural language text. we then propose a logic for database integration called whirl which reasons explicitly about the similarity of local names, as measured using the vector-space model commonly adopted in statistical information retrieval. an implemented data integration system based on whirl has been used to successfully integrate information from several dozen web sites in two domains. surrounding text:the early work of [21]<2> considered vague/imprecise similarity-based querying of databases. the problem of integrating databases and information retrieval systems has been attempted in several works [12, ***, 17, 18]<2>. information retrieval based approaches have been extended to xml retrieval in [26]<2> influence:2 type:2 pair index:255 citer id:243 citer title:automated ranking of database query results citer abstract:ranking and returning the most relevant results of a query is a popular paradigm in information retrieval. we discuss challenges and investigate several approaches to enable ranking in databases, including adaptations of known techniques from information retrieval. we present results of preliminary experiments citee id:253 citee title:fuzzy queries in multimedia database systems citee abstract:there are essential differences between multimedia databases (which may containcomplicated objects, such as images), and traditional databasesfi these differenceslead to interesting new issues, and in particular cause us to consider newtypes of queriesfi for example, in a multimedia database it is reasonable and naturalto ask for images that are somehow "similar to" some fixed imagefi furthermore,there are different ways of obtaining and accessing information in a multimediadatabase than surrounding text:a major concern of this paper is the query processing techniques for supporting ranking. several techniques have been previously developed in database research for the top-k problem [6, 7, ***, 15, 31]<1>. we adopt the algorithm in [15]<1> for our purposes, and discuss issues such as how the relational engine and indexes/materialized views can be leveraged for query performance influence:3 type:2 pair index:256 citer id:243 citer title:automated ranking of database query results citer abstract:ranking and returning the most relevant results of a query is a popular paradigm in information retrieval. we discuss challenges and investigate several approaches to enable ranking in databases, including adaptations of known techniques from information retrieval. we present results of preliminary experiments citee id:254 citee title:optimal aggregation algorithms for middleware citee abstract:assume that each object in a database has m grades, or scores, one for eachof m attributes. for example, an object can have a color grade, that tells how red it is, and a shape grade, that tells how round it is. for each attribute, there is a sorted list, which lists each object and its grade under that attribute, sorted by grade (highest grade first). each object is assigned an overall grade, that is obtained by combining the attribute grades using a fixed monotone aggregation function, or combining rule, suchas min or average. to determine the top k objects, that is, k objects with the highest overall grades, the naive algorithm must access every object in the database, to find its grade under each attribute. fagin has given an algorithm (‘‘fagin’s algorithm’� or fa) that is much more efficient. for some monotone aggregation functions, fa is optimal with high probability in the worst case. we analyze an elegant and remarkably simple algorithm (‘‘the threshold algorithm’� or ta) that is optimal in a much stronger sense than fa. we show that ta is essentially optimal, not just for some monotone aggregation functions, but for all of them, and not just in a high-probability worst-case sense, but over every database. unlike fa, which requires large buffers (whose size may grow unboundedly as the database size grows), ta requires only a small, constant-size buffer. ta allows early stopping, which yields, in a precise sense, an approximate version of the top k answers. we distinguish two types of access: sorted access (where the middleware system obtains the grade of an object in some sorted list by proceeding through the list sequentially from the top), and random access (where the middleware system requests the grade of object in a list, and obtains it in one step). we consider the scenarios where random access is either impossible, or expensive relative to sorted access, and provide algorithms that are essentially optimal for these cases as well. r 2003 elsevier science (usa). all rights reserved. surrounding text:a major concern of this paper is the query processing techniques for supporting ranking. several techniques have been previously developed in database research for the top-k problem [6, 7, 14, ***, 31]<1>. we adopt the algorithm in [***]<1> for our purposes, and discuss issues such as how the relational engine and indexes/materialized views can be leveraged for query performance. several techniques have been previously developed in database research for the top-k problem [6, 7, 14, ***, 31]<1>. we adopt the algorithm in [***]<1> for our purposes, and discuss issues such as how the relational engine and indexes/materialized views can be leveraged for query performance. 3. however, we observe that the overlap similarity function (in fact, all similarity functions discussed in this paper) satisfies a useful monotonic property: if t and u are two tuples such that for all k, sk(tk, qk) fi sk(uk, qk), then sim(t, q) fi sim(u, q). this enables us to adapt fagin’s threshold algorithm (ta) and its derivatives [7, ***]<1> to retrieve the top-k tuples without having to process all tuples of the database. to adapt ta for our purposes, we have to implement two types of access methods: (a) sorted access along any attribute ak, in which tids of tuples can be efficiently retrieved one-by-one in order of decreasing similarity of their ak attribute value from qk, and (b) random access, in which the entire tuple corresponding to any given tid can be efficiently retrieved influence:1 type:1 pair index:257 citer id:243 citer title:automated ranking of database query results citer abstract:ranking and returning the most relevant results of a query is a popular paradigm in information retrieval. we discuss challenges and investigate several approaches to enable ranking in databases, including adaptations of known techniques from information retrieval. we present results of preliminary experiments citee id:255 citee title:fastmap: a fast algorithm for indexing, data mining and visualization of traditional and multimedia datasets citee abstract:a very promising idea for fast searching in traditional and multimedia databases is to map objects into points in k-d space, using k feature-extraction functions, provided by a domain expert . thus, we can subsequently use highly fine-tuned spatial access methods (sams), to answer several types of queries, including the `query by example' type (which translates to a range query surrounding text:g. fast-map [***]<3>). another possible approach is to use inverted lists, a popular data structure in information retrieval influence:2 type:2 pair index:258 citer id:243 citer title:automated ranking of database query results citer abstract:ranking and returning the most relevant results of a query is a popular paradigm in information retrieval. we discuss challenges and investigate several approaches to enable ranking in databases, including adaptations of known techniques from information retrieval. we present results of preliminary experiments citee id:153 citee title:a probabilistic framework for vague queries and imprecise information in databases citee abstract:a probabilistic learning model for vague queries and missing or imprecise information in databases is described. instead of retrieving only a set of answers, our approach yields a ranking of objects from the database in response to a query. by using relevance judgements from the user about the objects retrieved, the ranking for the actual query as well as the overall retrieval quality of the system can be further improved. for specifying different kinds of conditions in vague queries, the notion of vague predicates is introduced. based on the underlying probabilistic model, also imprecise or missing attribute values can be treated easily. in addition, the corresponding formulas can be applied in combination with standard predicates (from two-valued logic), thus extending standard database systems for coping with missing or imprecise data. surrounding text:the early work of [21]<2> considered vague/imprecise similarity-based querying of databases. the problem of integrating databases and information retrieval systems has been attempted in several works [12, 13, ***, 18]<2>. information retrieval based approaches have been extended to xml retrieval in [26]<2> influence:2 type:2 pair index:259 citer id:243 citer title:automated ranking of database query results citer abstract:ranking and returning the most relevant results of a query is a popular paradigm in information retrieval. we discuss challenges and investigate several approaches to enable ranking in databases, including adaptations of known techniques from information retrieval. we present results of preliminary experiments citee id:155 citee title:a probabilistic relational model for the integration of ir and databases citee abstract:in this paper, a probabilistic relational model is presented which combines relational algebra with probabilistic retrieval. based on certain independence assumptions, the operators of the relational algebra are redefined such that the probabilistic algebra is a generalization of the standard relational algebra. furthermore, a special join operator implementing probabilistic retrieval is proposed. when applied to typical document databases, queries can not only ask for documents, but for any kind of object in the database. in addition, an implicit ranking of these objects is provided in case the query relates to probabilistic indexing or uses the probabilistic join operator. the proposed algebra is intended as a standard interface to combined database and ir systems, as a basis for implementing user-friendly interfaces. surrounding text:the early work of [21]<2> considered vague/imprecise similarity-based querying of databases. the problem of integrating databases and information retrieval systems has been attempted in several works [12, 13, 17, ***]<2>. information retrieval based approaches have been extended to xml retrieval in [26]<2> influence:2 type:2 pair index:260 citer id:243 citer title:automated ranking of database query results citer abstract:ranking and returning the most relevant results of a query is a popular paradigm in information retrieval. we discuss challenges and investigate several approaches to enable ranking in databases, including adaptations of known techniques from information retrieval. we present results of preliminary experiments citee id:256 citee title:preference sql - design, implementation, experiences citee abstract:current search engines can hardly cope ade-quately with fuzzy predicates defined by complex preferencesfi the biggest problem of search engines implemented with standard sql is that sql does not directly understand the notion of preferencesfi preference sql extends sql by a preference model based on strict partial orders (presented in more detail in the companion paper ), where preference queries behave like soft selection constraintsfi several built-in base preference types and the powerful pareto operator, combined with the adherence to declarative sql programming style, guarantees great programming producti-vityfi the preference sql optimizer does an effi-cient re-writing into standard sql, including a high-level implementation of the skyline opera-tor for pareto-optimal setsfi this pre-processor approach enables a seamless application surrounding text:the paper [30]<2> proposes distance functions for heterogeneous data, but the emphasis is on classification applications. in [***, 20]<2>, the authors propose sql extensions in which users can specify soft constraints in the form of preferences. these extensions broaden the expressiveness of search criteria by a user, but do not relieve the user from the onus of having to specify suitable ranking functions influence:3 type:2 pair index:261 citer id:243 citer title:automated ranking of database query results citer abstract:ranking and returning the most relevant results of a query is a popular paradigm in information retrieval. we discuss challenges and investigate several approaches to enable ranking in databases, including adaptations of known techniques from information retrieval. we present results of preliminary experiments citee id:257 citee title:foundations of preferences in database systems citee abstract:personalization of e-services poses new challenges to database technology, demanding a powerful and flexible modeling technique for complex preferences. preference queries have to be answered cooperatively by treating preferences as soft constraints, attempting a best possible match-making. we propose a strict partial order semantics for preferences, which closely matches people's intuition. a variety of natural and of sophisticated preferences are covered by this model. we show how to inductively construct complex preferences by means of various preference constructors. this model is the key to a new discipline called preference engineering and to a preference algebra. given the best-matches-only (bmo) query model we investigate how complex preference queries can be decomposed into simpler ones, preparing the ground for divide & conquer algorithms. standard sql and xpath can be extended seamlessly by such preferences (presented in detail in the companion paper ). we believe that this model is appropriate to extend database technology towards effective support of personalization. surrounding text:the paper [30]<2> proposes distance functions for heterogeneous data, but the emphasis is on classification applications. in [19, ***]<2>, the authors propose sql extensions in which users can specify soft constraints in the form of preferences. these extensions broaden the expressiveness of search criteria by a user, but do not relieve the user from the onus of having to specify suitable ranking functions influence:3 type:2 pair index:262 citer id:243 citer title:automated ranking of database query results citer abstract:ranking and returning the most relevant results of a query is a popular paradigm in information retrieval. we discuss challenges and investigate several approaches to enable ranking in databases, including adaptations of known techniques from information retrieval. we present results of preliminary experiments citee id:258 citee title:vague: a user interface to relational databases that permits vague queries citee abstract:a specific query establishes a rigid qualification and is concerned only with data that match it precisely. a vague query establishes a target qualification and is concerned also with data that are close to this target. most conventional database systems cannot handle vague queries directly, forcing their users to retry specific queries repeatedly with minor modifications until they match data that are satisfactory. this article describes a system called vague that can handle vague queries directly. the principal concept behind vague is its extension to the relational data model with data metrics, which are definitions of distances between values of the same domain. a problem with implementing data distances is that different users may have different interpretations for the notion of distance. vague incorporates several features that enable it to adapt itself to the individual views and priorities of its users. surrounding text:in database research, there has been some scattered work on the automatic extraction of similarity/ranking functions from a database. the early work of [***]<2> considered vague/imprecise similarity-based querying of databases. the problem of integrating databases and information retrieval systems has been attempted in several works [12, 13, 17, 18]<2> influence:2 type:2 pair index:263 citer id:243 citer title:automated ranking of database query results citer abstract:ranking and returning the most relevant results of a query is a popular paradigm in information retrieval. we discuss challenges and investigate several approaches to enable ranking in databases, including adaptations of known techniques from information retrieval. we present results of preliminary experiments citee id:259 citee title:experiences in mining aviation safety data citee abstract:the goal of data analysis in aviation safety is simple: improve safety. however, the path to this goal is hard to identify. what data mining methods are most applicable to this task? what data are available and how should they be analyzed? how do we focus on the most interesting results? our answers to these questions are based on a recent research project we completed. the encouraging news is that we found a number of aviation safety offices doing commendable work to collect and analyze safety-related data. but we also found a number of areas where data mining techniques could provide new tools that either perform analyses that were not considered before, or that can now be done more easily. currently, aviation safety offices collect and analyze the incident reports by a combination of manual and automated methods. data analysis is done by safety officers who are well familiar with the domain, but not with data mining methods. some aviation safety officers have tools to automate the database query and report generation process. however, the actual analysis is done by the officer with only fairly rudimentary tools to help extract the useful information from the data. our research project looked at the application of data mining techniques to aviation safety data to help aviation safety officers with their analysis task. this effort led to the creation of a tool called the “aviation safety data mining workbench� this paper describes the research effort, the workbench, the experience with data mining of aviation safety data, and lessons learned. surrounding text:the distinguishing aspects of our work from the above are (a) we address the challenges that a heterogeneous mix of numeric as well as categorical attributes pose, and (b) we propose a novel and easy to implement ranking method based on query workload analysis. although [***]<2> describes a ranking application for a mix of categorical and numeric data, the similarity function is not automatically derived but rather is based on domain knowledge of the application. the paper [30]<2> proposes distance functions for heterogeneous data, but the emphasis is on classification applications influence:3 type:2 pair index:264 citer id:243 citer title:automated ranking of database query results citer abstract:ranking and returning the most relevant results of a query is a popular paradigm in information retrieval. we discuss challenges and investigate several approaches to enable ranking in databases, including adaptations of known techniques from information retrieval. we present results of preliminary experiments citee id:217 citee title:an approach to integrating query refinement in sql citee abstract:with the emergence of applications that require content-based similarity retrieval, techniques to support such a retrieval paradigm over database systems have emerged as a critical area of researchfi user subjectivity is an important aspect of such queries, ifiefi, which objects are relevant to the user and which are not depends on the perception of the userfi query refinement is used to handle user subjectivity in similarity search systemsfi this paper explores how to enhance database systems with surrounding text:information retrieval based approaches have been extended to xml retrieval in [26]<2>. the papers [10, ***, 24, 32]<2> employ relevance-feedback techniques for learning similarity in multimedia and relational databases. a keyword-based retrieval system over databases is proposed in [1]<2> influence:2 type:2 pair index:265 citer id:243 citer title:automated ranking of database query results citer abstract:ranking and returning the most relevant results of a query is a popular paradigm in information retrieval. we discuss challenges and investigate several approaches to enable ranking in databases, including adaptations of known techniques from information retrieval. we present results of preliminary experiments citee id:260 citee title:content-based image retrieval with relevance feedback in mars citee abstract:technology advances in the areas of image processing (ip) and information retrieval (ir) have evolved separately for a long time. however, successful content-based image retrieval systems require the integration of the two. there is an urgent need to develop integration mechanisms to link the image retrieval model to text retrieval model, such that the well established text retrieval techniques can be utilized. approaches of converting image feature vectors (if domain) to weighted-term vectors (ir domain) are proposed in this paper. furthermore, the relevance feedback technique from the ir domain is used in content-based image retrieval to demonstrate the effectiveness of this conversion. experimental results show that the image retrieval precision increases considerably by using the proposed integration approach surrounding text:information retrieval based approaches have been extended to xml retrieval in [26]<2>. the papers [10, 23, ***, 32]<2> employ relevance-feedback techniques for learning similarity in multimedia and relational databases. a keyword-based retrieval system over databases is proposed in [1]<2> influence:3 type:2 pair index:266 citer id:243 citer title:automated ranking of database query results citer abstract:ranking and returning the most relevant results of a query is a popular paradigm in information retrieval. we discuss challenges and investigate several approaches to enable ranking in databases, including adaptations of known techniques from information retrieval. we present results of preliminary experiments citee id:261 citee title:density estimation citee abstract:this articlefi we introduce the classic nonparametric estimator, the histogram,and outline its theoretical properties as well as good practicefi we demonstrate how toimprove the histogram, leading to our discussion of popular kernel methodsfi we concludewith a bivariate example, a way of choosing smoothing parameters, and new directionsthat promise further improvementsfiwhy choose nonparametric over parametric density estimation? parametric densityestimation requires both proper specification surrounding text:whereas, if q also has the value t, then s(t, q) degenerates to log(n/nt), which is exactly the formula for categorical data. the above numerical extensions to idf have been derived using kernel density estimation techniques [***]<3>. a popular estimate for the bandwidth is h = 1 influence:3 type:3 pair index:267 citer id:243 citer title:automated ranking of database query results citer abstract:ranking and returning the most relevant results of a query is a popular paradigm in information retrieval. we discuss challenges and investigate several approaches to enable ranking in databases, including adaptations of known techniques from information retrieval. we present results of preliminary experiments citee id:262 citee title:the index-based xxl search engine for querying xml data with relevance ranking citee abstract:query languages for xml such as xpath or xquery support boolean retrieval: a query result is a (possibly restructured) subset of xml elements or entire documents that satisfy the search conditions of the query. this search paradigm works for highly schematic xml data collections such as electronic catalogs. however, for searching information in open environments such as the web or intranets of large corporations, ranked retrieval is more appropriate: a query result is a rank list of xml... surrounding text:the problem of integrating databases and information retrieval systems has been attempted in several works [12, 13, 17, 18]<2>. information retrieval based approaches have been extended to xml retrieval in [***]<2>. the papers [10, 23, 24, 32]<2> employ relevance-feedback techniques for learning similarity in multimedia and relational databases influence:3 type:3 pair index:268 citer id:243 citer title:automated ranking of database query results citer abstract:ranking and returning the most relevant results of a query is a popular paradigm in information retrieval. we discuss challenges and investigate several approaches to enable ranking in databases, including adaptations of known techniques from information retrieval. we present results of preliminary experiments citee id:263 citee title:improved heterogeneous distance functions citee abstract:instance-based learning techniques typically handle continuous and linear input values well,but often do not handle nominal input attributes appropriately. the value difference metric(vdm) was designed to find reasonable distance values between nominal attribute values, but itlargely ignores continuous attributes, requiring discretization to map continuous values intonominal values. this paper proposes three new heterogeneous distance functions, called theheterogeneous value difference metric (hvdm), the interpolated value difference metric(ivdm), and the windowed value difference metric (wvdm). these new distance functionsare designed to handle applications with nominal attributes, continuous attributes, or both. inexperiments on 48 applications the new distance metrics achieve higher classification accuracyon average than three previous distance functions on those datasets that have both nominal andcontinuous attributes. surrounding text:although [22]<2> describes a ranking application for a mix of categorical and numeric data, the similarity function is not automatically derived but rather is based on domain knowledge of the application. the paper [***]<2> proposes distance functions for heterogeneous data, but the emphasis is on classification applications. in [19, 20]<2>, the authors propose sql extensions in which users can specify soft constraints in the form of preferences influence:2 type:2 pair index:269 citer id:243 citer title:automated ranking of database query results citer abstract:ranking and returning the most relevant results of a query is a popular paradigm in information retrieval. we discuss challenges and investigate several approaches to enable ranking in databases, including adaptations of known techniques from information retrieval. we present results of preliminary experiments citee id:264 citee title:using fagin's algorithm for merging ranked results in multimedia middleware citee abstract:a distributed multimedia information system allows users to access data of different modalities, from different data sources, ranked by various combinations of criteria. in , fagin gives an algorithm for efficiently merging multiple ordered streams of ranked results, to form a new stream ordered by a combination of those ranks. in this paper, we describe the implementation of fagin’s algorithm in an actual multimedia middleware system, including a novel, incremental version of the algorithm that supports dynamic exploration of data. we show that the algorithm would perform well as part of a single multimedia server, and can even be effective in the distributed environment (for a limited set of queries), but that the assumptions it makes about random access limit its applicability dramatically. our experience provides a better understanding of an important algorithm, and exposes an open problem for distributed multimedia information systems. surrounding text:a major concern of this paper is the query processing techniques for supporting ranking. several techniques have been previously developed in database research for the top-k problem [6, 7, 14, 15, ***]<1>. we adopt the algorithm in [15]<1> for our purposes, and discuss issues such as how the relational engine and indexes/materialized views can be leveraged for query performance influence:3 type:1 pair index:270 citer id:243 citer title:automated ranking of database query results citer abstract:ranking and returning the most relevant results of a query is a popular paradigm in information retrieval. we discuss challenges and investigate several approaches to enable ranking in databases, including adaptations of known techniques from information retrieval. we present results of preliminary experiments citee id:265 citee title:falcon: feedback adaptive loop for content-based retrieval citee abstract:several methods currently exist that canperform relatively simple queries drivenby relevance feedback on large multimediadatabases. however, all these methods workonly for vector spaces; that is, they requirethat objects be represented as vectors withinfeature spaces. moreover, their implied queryregions are typically convex. this research pa-per explains our solution.we propose a novel method that is designedto handle disjunctive queries within metricspaces. the user provides weights for pos-itive examples; our system \learns" the im-plied concept and returns similar objects.our method diers from existing relevance-feedback methods that base themselves uponeuclidean or mahalanobis metrics, as it facili-tates learning even disjunctive, concave mod-els within vector spaces, as well as arbitrarymetric spaces. in addition, our method iscompletely example-driven, and imposes norequirements upon the user for other aspectssuch as feature selection.our main contributions are two-fold. not onlydo we present a novel way to estimate thedissimilarity of an object to a set of desir-able objects, but we support it with an algo-rithm that shows how to exploit metric index-ing structures that support range queries toaccelerate the search without incurring falsedismissals. our empirical results demonstrate that our method converges rapidly to excellentprecision/recall, while outperforming sequen-tial scanning by up to 200%. surrounding text:information retrieval based approaches have been extended to xml retrieval in [26]<2>. the papers [10, 23, 24, ***]<2> employ relevance-feedback techniques for learning similarity in multimedia and relational databases. a keyword-based retrieval system over databases is proposed in [1]<2> influence:3 type:2 pair index:271 citer id:308 citer title:co-clustering documents and words using bipartite spectral graph partitioning citer abstract:both document clustering and word clustering are well studied problems. most existing algorithms cluster documents and words separately but not simultaneously. in this paper we present the novel idea of modeling the document collection as a bipartite graph between documents and words, using which the simultaneous clustering problem can be posed as a bipartite graph partitioning problem. to solve the partitioning problem, we use a new spectral co-clustering algorithm that uses the second left and right singular vectors of an appropriately scaled word-document matrix to yield good bipartitionings. the spectral algorithm enjoys some optimality properties; it can be shown that the singular vectors solve a real relaxation to the np-complete graph bipartitioning problem. we present experimental results to verify that the resulting co-clustering algorithm works well in practice citee id:309 citee title:distributional clustering of words for text classification citee abstract:this paper applies distributional clustering (pereiraet al.1993) to document classication. the approachclusters words into groups based on the distributionof class labels associated with each word. thus, un-like some otherunsuperviseddimensionality-reductiontechniques, such as latent semantic indexing, we areable to compress the feature space much more aggres-sively, while still maintaining high document classi-cation accuracy.experimental results obtained on three real-worlddata sets show that we can reduce the feature dimen-sionality by three orders of magnitude and lose only2% accuracysignicantly better than latent seman-tic indexing (deerwesteret al.1990), class-based clus-tering (brownet al.1992), feature selection by mutualinformation (yang and pederson 1997), or markov-blanket-based feature selection (koller and sahami1996). we also show that less aggressive clusteringsometimes results in improved classication accuracyover classication without clustering. surrounding text:the underlying assumption is that words that typically appear together should be associated with similar concepts. word clustering has also been profitably used in the automatic classification of documents, see[***]<2>. more on word clustering may be found in [24]<2> influence:2 type:2 pair index:272 citer id:308 citer title:co-clustering documents and words using bipartite spectral graph partitioning citer abstract:both document clustering and word clustering are well studied problems. most existing algorithms cluster documents and words separately but not simultaneously. in this paper we present the novel idea of modeling the document collection as a bipartite graph between documents and words, using which the simultaneous clustering problem can be posed as a bipartite graph partitioning problem. to solve the partitioning problem, we use a new spectral co-clustering algorithm that uses the second left and right singular vectors of an appropriately scaled word-document matrix to yield good bipartitionings. the spectral algorithm enjoys some optimality properties; it can be shown that the singular vectors solve a real relaxation to the np-complete graph bipartitioning problem. we present experimental results to verify that the resulting co-clustering algorithm works well in practice citee id:310 citee title:hierarchical taxonomies using divisive partitioning citee abstract:we propose an unsupervised divisive partitioning algorithm for document data sets which enjoys many favorable propertiesfi in particular, the algorithm shows excellent scalability to large data collections and produces high quality clusters which are competitive with other clustering methodsfi the algorithm yields information on the significant and distinctive words within each cluster, and these words can be inserted into the naturally occuring hierarchical structure produced by the algorithmfi surrounding text:to show that our algorithm works well on small data sets, we also created subsets of classic3 with 30 and 150 documents respectively. our final data set is a collection of 2340 reuters news articles downloaded from yahoo in october 1997[***]<3>. the articles are from 6 categories: 142 from business, 1384 from entertainment, 494 from health, 114 from politics, 141 from sports and 60 news articles from technology influence:3 type:3 pair index:273 citer id:308 citer title:co-clustering documents and words using bipartite spectral graph partitioning citer abstract:both document clustering and word clustering are well studied problems. most existing algorithms cluster documents and words separately but not simultaneously. in this paper we present the novel idea of modeling the document collection as a bipartite graph between documents and words, using which the simultaneous clustering problem can be posed as a bipartite graph partitioning problem. to solve the partitioning problem, we use a new spectral co-clustering algorithm that uses the second left and right singular vectors of an appropriately scaled word-document matrix to yield good bipartitionings. the spectral algorithm enjoys some optimality properties; it can be shown that the singular vectors solve a real relaxation to the np-complete graph bipartitioning problem. we present experimental results to verify that the resulting co-clustering algorithm works well in practice citee id:311 citee title:document categorization and query generation on the world wide web using webace citee abstract:we present webace, an agent for exploring and categorizing documents on the world wide web based on a user profilefi the heart of the agent is an unsupervised categorization of a set of documents, combined with a process for generating new queries that is used to search for new related documents and for filtering the resulting documents to extract the ones most closely related to the starting setfi the document categories are not given a priorifi we present the overall architecture and describe surrounding text:graph-theoretic techniques have also been considered for clustering. many earlier hierarchical agglomerative clustering algorithms[9]<2> and some recent work[***, 23]<2> model the similarity between documents by a graph whose vertices correspond to documents and weighted edges or hyperedges give the similarity between vertices. however these methods are computationally prohibitive for large collections since the amount of work required just to form the graph is quadratic in the number of documents influence:1 type:2 pair index:274 citer id:308 citer title:co-clustering documents and words using bipartite spectral graph partitioning citer abstract:both document clustering and word clustering are well studied problems. most existing algorithms cluster documents and words separately but not simultaneously. in this paper we present the novel idea of modeling the document collection as a bipartite graph between documents and words, using which the simultaneous clustering problem can be posed as a bipartite graph partitioning problem. to solve the partitioning problem, we use a new spectral co-clustering algorithm that uses the second left and right singular vectors of an appropriately scaled word-document matrix to yield good bipartitionings. the spectral algorithm enjoys some optimality properties; it can be shown that the singular vectors solve a real relaxation to the np-complete graph bipartitioning problem. we present experimental results to verify that the resulting co-clustering algorithm works well in practice citee id:10 citee title:a cluster-based approach to thesaurus construction citee abstract:the importance of a thesaurus in the successful operation of an information retrieval system is well recognized. yet techniques which support the automatic generation of thesauri remain largely undiscovered. this paper describes one approach to the automatic generation of global thesauri, based on the discrimination value model of salton, yang, and yu and on an appropriate clustering algorithm. this method has been implemented and applied to two document collections. preliminary results indicate that this method, which produces improvements in retrieval performance in excess of 10 and 15 percent in the test collections, is viable and worthy of continued investigation surrounding text:words may be clustered on the basis of the documents in which they co-occur. such clustering has been used in the automatic construction of a statistical thesaurus and in the enhancement of queries[***]<2>. the underlying assumption is that words that typically appear together should be associated with similar concepts influence:3 type:2 pair index:275 citer id:308 citer title:co-clustering documents and words using bipartite spectral graph partitioning citer abstract:both document clustering and word clustering are well studied problems. most existing algorithms cluster documents and words separately but not simultaneously. in this paper we present the novel idea of modeling the document collection as a bipartite graph between documents and words, using which the simultaneous clustering problem can be posed as a bipartite graph partitioning problem. to solve the partitioning problem, we use a new spectral co-clustering algorithm that uses the second left and right singular vectors of an appropriately scaled word-document matrix to yield good bipartitionings. the spectral algorithm enjoys some optimality properties; it can be shown that the singular vectors solve a real relaxation to the np-complete graph bipartitioning problem. we present experimental results to verify that the resulting co-clustering algorithm works well in practice citee id:312 citee title:scatter/gather: a cluster-based approach to browsing large document collections citee abstract:document clustering has not been well received as an informationretrieval toolfi objections to its use fall intotwo main categories: first, that clustering is too slow forlarge corpora (with running time often quadratic in thenumber of documents); and second, that clustering doesnot appreciably improve retrievalfiwe argue that these problems arise only when clusteringis used in an attempt to improve conventional searchtechniquesfi however, looking at clustering as an informationaccess surrounding text:existing document clustering methods include agglomerative clustering[25]<2>, the partitional k-means algorithm[7]<2>, projection based methods including lsa[21]<2>, self-organizing maps[18]<2> and multidimensional scaling[16]<2>. for computational efficiency required in on-line clustering, hybrid approaches have been considered such as in[***]<2>. graph-theoretic techniques have also been considered for clustering influence:1 type:2 pair index:276 citer id:308 citer title:co-clustering documents and words using bipartite spectral graph partitioning citer abstract:both document clustering and word clustering are well studied problems. most existing algorithms cluster documents and words separately but not simultaneously. in this paper we present the novel idea of modeling the document collection as a bipartite graph between documents and words, using which the simultaneous clustering problem can be posed as a bipartite graph partitioning problem. to solve the partitioning problem, we use a new spectral co-clustering algorithm that uses the second left and right singular vectors of an appropriately scaled word-document matrix to yield good bipartitionings. the spectral algorithm enjoys some optimality properties; it can be shown that the singular vectors solve a real relaxation to the np-complete graph bipartitioning problem. we present experimental results to verify that the resulting co-clustering algorithm works well in practice citee id:313 citee title:efficient clustering of very large document collections citee abstract:an invaluable portion of scientific data occurs naturally in text formfi given a large unlabeled document collection, it is often helpful to organize this collection into clusters of related documentsfi by using a vector space model, text data can be treated as high-dimensional but sparse numerical data vectorsfi it is a contemporary challenge to efficiently preprocess and cluster very large document collectionsfi in this paper we present a time and memory ecient technique for the entire clustering surrounding text:since we know the “true�class label for each document, the confusion matrix captures the goodness of document clustering. in addition, the measures of purity and entropy are easily derived from the confusion matrix[***]<2>. table 2 summarizes the results of applying algorithm bipartition to the medcran data set influence:1 type:2 pair index:277 citer id:308 citer title:co-clustering documents and words using bipartite spectral graph partitioning citer abstract:both document clustering and word clustering are well studied problems. most existing algorithms cluster documents and words separately but not simultaneously. in this paper we present the novel idea of modeling the document collection as a bipartite graph between documents and words, using which the simultaneous clustering problem can be posed as a bipartite graph partitioning problem. to solve the partitioning problem, we use a new spectral co-clustering algorithm that uses the second left and right singular vectors of an appropriately scaled word-document matrix to yield good bipartitionings. the spectral algorithm enjoys some optimality properties; it can be shown that the singular vectors solve a real relaxation to the np-complete graph bipartitioning problem. we present experimental results to verify that the resulting co-clustering algorithm works well in practice citee id:314 citee title:concept decompositions for large sparse text data using clustering citee abstract:unstructured text documents are becoming increasingly common and available; mining such data sets represents a major contemporary challengefi using words as features, text documents are often represented as high-dimensional and sparse vectors - a few thousand dimensions and a sparsity of 95 to 99% is typicalfi in this paper, we study a certain spherical k-means algorithm for clustering such document vectorsfi the algorithm outputs k disjoint clusters each with a concept vector that is the centroid surrounding text:typically, a large number of words exist in even a moderately sized set of documents, for example, in one test case we use 4303 words in 3893 documents, however, each document generally contains only a small number of words and hence, a is typically very sparse with almost 99% of the matrix entries being zero. existing document clustering methods include agglomerative clustering[25]<2>, the partitional k-means algorithm[***]<2>, projection based methods including lsa[21]<2>, self-organizing maps[18]<2> and multidimensional scaling[16]<2>. for computational efficiency required in on-line clustering, hybrid approaches have been considered such as in[5]<2> influence:1 type:2 pair index:278 citer id:308 citer title:co-clustering documents and words using bipartite spectral graph partitioning citer abstract:both document clustering and word clustering are well studied problems. most existing algorithms cluster documents and words separately but not simultaneously. in this paper we present the novel idea of modeling the document collection as a bipartite graph between documents and words, using which the simultaneous clustering problem can be posed as a bipartite graph partitioning problem. to solve the partitioning problem, we use a new spectral co-clustering algorithm that uses the second left and right singular vectors of an appropriately scaled word-document matrix to yield good bipartitionings. the spectral algorithm enjoys some optimality properties; it can be shown that the singular vectors solve a real relaxation to the np-complete graph bipartitioning problem. we present experimental results to verify that the resulting co-clustering algorithm works well in practice citee id:315 citee title:lower bounds for the partitioning of graphs citee abstract:let a k-partition of a graph be a division of the vertices into k disjoht subsets containing m, 2 m2,. . ., 2 mk vertices. let e, be the number of edges whose two vertices belong to different subsets. let a, 1 a,, . . ., 2 a, be the k largest eigenvalues of a matrix, which is the sum of the adjacency matrix of the graph plus any diagonal matrix u such that the sum of all the elements of the sum matrix is zero. then a theorem is given that shows the effect of the maximum degree of any node being limited, and it is also shown that the right-hand side is a concave function of u. computational studies are made of the ratio of upper bound to lower bound for the two-partition of a number of random graphs having up to 100 nodes. surrounding text:3. 1 spectral graph bipartitioning spectral graph partitioning is another effective heuristic that was introduced in the early 1970s[15, ***, 11]<1>, and popularized in 1990[19]<1>. spectral partitioning generally gives better global solutions than the kl or fm methods influence:3 type:1 pair index:279 citer id:308 citer title:co-clustering documents and words using bipartite spectral graph partitioning citer abstract:both document clustering and word clustering are well studied problems. most existing algorithms cluster documents and words separately but not simultaneously. in this paper we present the novel idea of modeling the document collection as a bipartite graph between documents and words, using which the simultaneous clustering problem can be posed as a bipartite graph partitioning problem. to solve the partitioning problem, we use a new spectral co-clustering algorithm that uses the second left and right singular vectors of an appropriately scaled word-document matrix to yield good bipartitionings. the spectral algorithm enjoys some optimality properties; it can be shown that the singular vectors solve a real relaxation to the np-complete graph bipartitioning problem. we present experimental results to verify that the resulting co-clustering algorithm works well in practice citee id:316 citee title:pattern classification citee abstract:pattern classification consists in assigning entities, described by featurevectors, to predefined groups of patternsfi when the statistical characteristicsof the problem under consideration are perfectly known, minimalerror probability can be achieved by means of the bayes decision rulefiin practice, however, a sub-optimal classifier has to be constructed fromtraining datafi several neural network approaches to this problem havebeen proposedfi nearest-neighbor models are based on assessing surrounding text:graph-theoretic techniques have also been considered for clustering. many earlier hierarchical agglomerative clustering algorithms[***]<2> and some recent work[3, 23]<2> model the similarity between documents by a graph whose vertices correspond to documents and weighted edges or hyperedges give the similarity between vertices. however these methods are computationally prohibitive for large collections since the amount of work required just to form the graph is quadratic in the number of documents influence:3 type:1 pair index:280 citer id:308 citer title:co-clustering documents and words using bipartite spectral graph partitioning citer abstract:both document clustering and word clustering are well studied problems. most existing algorithms cluster documents and words separately but not simultaneously. in this paper we present the novel idea of modeling the document collection as a bipartite graph between documents and words, using which the simultaneous clustering problem can be posed as a bipartite graph partitioning problem. to solve the partitioning problem, we use a new spectral co-clustering algorithm that uses the second left and right singular vectors of an appropriately scaled word-document matrix to yield good bipartitionings. the spectral algorithm enjoys some optimality properties; it can be shown that the singular vectors solve a real relaxation to the np-complete graph bipartitioning problem. we present experimental results to verify that the resulting co-clustering algorithm works well in practice citee id:58 citee title:a linear time heuristic for improving network partitions citee abstract:the fiduccia-matteyses min-cut heuristic provides an efficient solution to theproblem of separating a network of vertices into 2 separate partitions in an effort tominimize the number of nets which contain nodes in each partition. the heuristic isdesigned to specifically handle large and complex networks which contain multi-terminaland also weighted cells. the worst case computation time of this heuristic increaseslinearly with the overall size of the network. additionally, in practice in can be seen thatthe number of iterations that is typically required for the cutest to converge to theminimum value is typically very small. the key factor in obtaining this linear-timebehavior is due to the fact that the algorithm moves one node at a time between partitionsin an attempt to reduce the current cut-set by the maximum possible value. we must alsonote that at times, the maximum cut-set reduction will be negative; however, at this pointwe proceed with the algorithm as this allows the opportunity escape from a local minima,should we currently be in one. upon the movement of a node, appropriate nodesconnected to the moved cell are updated to represent the move. the use of simple, yetefficient data structures allows us the ability to avoid redundant searching for best cell tobe moved, and similarly also prevents an excess and unnecessary updates to the neighborcells that need to be updated. thus the data structures themselves, assist in adding to thealready efficient nature of the algorithm. the heuristic also contains a balance constraintwhich allows the user to retain control over the size of the created partitions. surrounding text:however it is well known that this problem is np-complete[12]<3>. but many effective heuristic methods exist, such as, the kernighan-lin(kl)[17]<2> and the fiduccia-mattheyses(fm)[***]<2> algorithms. however, both the kl and fm algorithms search in the local vicinity of given initial partitionings and have a tendency to get stuck in local minima influence:3 type:1 pair index:281 citer id:308 citer title:co-clustering documents and words using bipartite spectral graph partitioning citer abstract:both document clustering and word clustering are well studied problems. most existing algorithms cluster documents and words separately but not simultaneously. in this paper we present the novel idea of modeling the document collection as a bipartite graph between documents and words, using which the simultaneous clustering problem can be posed as a bipartite graph partitioning problem. to solve the partitioning problem, we use a new spectral co-clustering algorithm that uses the second left and right singular vectors of an appropriately scaled word-document matrix to yield good bipartitionings. the spectral algorithm enjoys some optimality properties; it can be shown that the singular vectors solve a real relaxation to the np-complete graph bipartitioning problem. we present experimental results to verify that the resulting co-clustering algorithm works well in practice citee id:317 citee title:computers and intractability: a guide to the theory of np-completeness citee abstract:it was the very first book on the theory of np-completeness and computational intractability. the book features an appendix providing a thorough compendium of np-complete problems (which was updated in later editions of the book). the book is now outdated in some respects as it does not cover more recent development such as the pcp theorem. it is nevertheless still in print and is regarded as a classic: in a 2006 study, the citeseer search engine listed the book as the most cited reference in computer science literature. surrounding text:graph partitioning is an important problem and arises in various applications, such as circuit partitioning, telephone network design, load balancing in parallel computation, etc. however it is well known that this problem is np-complete[***]<3>. but many effective heuristic methods exist, such as, the kernighan-lin(kl)[17]<2> and the fiduccia-mattheyses(fm)[10]<2> algorithms influence:3 type:3 pair index:282 citer id:308 citer title:co-clustering documents and words using bipartite spectral graph partitioning citer abstract:both document clustering and word clustering are well studied problems. most existing algorithms cluster documents and words separately but not simultaneously. in this paper we present the novel idea of modeling the document collection as a bipartite graph between documents and words, using which the simultaneous clustering problem can be posed as a bipartite graph partitioning problem. to solve the partitioning problem, we use a new spectral co-clustering algorithm that uses the second left and right singular vectors of an appropriately scaled word-document matrix to yield good bipartitionings. the spectral algorithm enjoys some optimality properties; it can be shown that the singular vectors solve a real relaxation to the np-complete graph bipartitioning problem. we present experimental results to verify that the resulting co-clustering algorithm works well in practice citee id:318 citee title:study of document representations: multidimensional scaling of indexing terms citee abstract:the investigation is part of a continuing project to find better means for studying and evaluating document representations (e.g., s, indexes). the present study focused on understanding the ways individuals interpret index terms, with the goal of using the findings in improving indexing systems. indexing systems use all-or-none bases for matching descriptors from documents with descriptors from information requests. humans, in judging the similarity of meaning between different descriptors, can respond in terms of degrees of similarity. there appears to be more discrimination power available in the human judgments than in the all-or-none matching arrangements. the question is, if data on human judgments of similarity between indexing terms were available to a machine searching system, would search accuracy be improved over present levels. conclusions were that incorporation of such similarity relations into mechanical index-term matching procedures would indeed increase search accuracy, and that the procedures used in the study can be adapted to such use surrounding text:typically, a large number of words exist in even a moderately sized set of documents, for example, in one test case we use 4303 words in 3893 documents, however, each document generally contains only a small number of words and hence, a is typically very sparse with almost 99% of the matrix entries being zero. existing document clustering methods include agglomerative clustering[25]<2>, the partitional k-means algorithm[7]<2>, projection based methods including lsa[21]<2>, self-organizing maps[18]<2> and multidimensional scaling[***]<2>. for computational efficiency required in on-line clustering, hybrid approaches have been considered such as in[5]<2> influence:3 type:2 pair index:283 citer id:308 citer title:co-clustering documents and words using bipartite spectral graph partitioning citer abstract:both document clustering and word clustering are well studied problems. most existing algorithms cluster documents and words separately but not simultaneously. in this paper we present the novel idea of modeling the document collection as a bipartite graph between documents and words, using which the simultaneous clustering problem can be posed as a bipartite graph partitioning problem. to solve the partitioning problem, we use a new spectral co-clustering algorithm that uses the second left and right singular vectors of an appropriately scaled word-document matrix to yield good bipartitionings. the spectral algorithm enjoys some optimality properties; it can be shown that the singular vectors solve a real relaxation to the np-complete graph bipartitioning problem. we present experimental results to verify that the resulting co-clustering algorithm works well in practice citee id:319 citee title:self-organizing maps citee abstract:self-organizing maps (soms) are a data visualization technique invented by professor teuvo kohonen which reduce the dimensions of data through the use of self-organizing neural networks. the problem that data visualization attempts to solve is that humans simply cannot visualize high dimensional data as is so techniques are created to help us understand this high dimensional data. two other techniques of reducing the dimensions of data that has been presented in this course has been n-land and multi-dimensional scaling. the way soms go about reducing dimensions is by producing a map of usually 1 or 2 dimensions which plot the similarities of the data by grouping similar data items together. so soms accomplish two things, they reduce dimensions and display similarities. just to give you an idea of what a som looks like, here is an example of a som which i constructed. as you can see, like colors are grouped together such as the greens are all in the upper left hand corner and the purples are all grouped around the lower right and right hand side. surrounding text:typically, a large number of words exist in even a moderately sized set of documents, for example, in one test case we use 4303 words in 3893 documents, however, each document generally contains only a small number of words and hence, a is typically very sparse with almost 99% of the matrix entries being zero. existing document clustering methods include agglomerative clustering[25]<2>, the partitional k-means algorithm[7]<2>, projection based methods including lsa[21]<2>, self-organizing maps[***]<2> and multidimensional scaling[16]<2>. for computational efficiency required in on-line clustering, hybrid approaches have been considered such as in[5]<2> influence:3 type:2 pair index:284 citer id:308 citer title:co-clustering documents and words using bipartite spectral graph partitioning citer abstract:both document clustering and word clustering are well studied problems. most existing algorithms cluster documents and words separately but not simultaneously. in this paper we present the novel idea of modeling the document collection as a bipartite graph between documents and words, using which the simultaneous clustering problem can be posed as a bipartite graph partitioning problem. to solve the partitioning problem, we use a new spectral co-clustering algorithm that uses the second left and right singular vectors of an appropriately scaled word-document matrix to yield good bipartitionings. the spectral algorithm enjoys some optimality properties; it can be shown that the singular vectors solve a real relaxation to the np-complete graph bipartitioning problem. we present experimental results to verify that the resulting co-clustering algorithm works well in practice citee id:80 citee title:partitioning sparse matrices with eigenvectors of graphs citee abstract:the problem of computing a small vertex separator in a graph arises in the context of computing a good ordering for the parallel factorization of sparse, symmetric matrices. an algebraic approach to computing vertex separators is considered in this paper. it is shown that lower bounds on separator sizes can be obtained in terms of the eigenvalues of the laplacian matrix associated with a graph. the laplacian eigenvectors of grid graphs can be computed from kronecker products involving the eigenvectors of path graphs, and these eigenvectors can be used to compute good separators in grid graphs. a heuristic algorithm is designed to compute a vertex separator in a general graph by first computing an edge separator in the graph from an eigenvector of the laplaeian matrix, and then using a maximum matching in a subgraph to compute the vertex separator. results on the quality of the separators computed by the spectral algorithm are presented, and these are compared with separators obtained from automatic nested dissection and the kernighan-lin algorithm. finally, we report the time required to compute the laplacian eigenvector, and consider the accuracy with which the eigenvector must be computed to obtain good separators. the spectral algorithm has'the advantage that it can be implemented on a medium size multiprocessor in a straight forward manner. surrounding text:3. 1 spectral graph bipartitioning spectral graph partitioning is another effective heuristic that was introduced in the early 1970s[15, 8, 11]<1>, and popularized in 1990[***]<1>. spectral partitioning generally gives better global solutions than the kl or fm methods influence:3 type:1 pair index:285 citer id:308 citer title:co-clustering documents and words using bipartite spectral graph partitioning citer abstract:both document clustering and word clustering are well studied problems. most existing algorithms cluster documents and words separately but not simultaneously. in this paper we present the novel idea of modeling the document collection as a bipartite graph between documents and words, using which the simultaneous clustering problem can be posed as a bipartite graph partitioning problem. to solve the partitioning problem, we use a new spectral co-clustering algorithm that uses the second left and right singular vectors of an appropriately scaled word-document matrix to yield good bipartitionings. the spectral algorithm enjoys some optimality properties; it can be shown that the singular vectors solve a real relaxation to the np-complete graph bipartitioning problem. we present experimental results to verify that the resulting co-clustering algorithm works well in practice citee id:320 citee title:projections for efficient document clustering citee abstract:clustering is increasing in importance, but linear- and even constant-time clustering algorithms are often too slow for real-time applicationsfi a simple way to speed up clustering is to speed up the distance calculations at the heart of clustering routinesfi we study two techniques for improving the cost of distance calculations, lsi and truncation, and determine both how much these techniques speed up clustering and how much they affect the quality of the resulting clustersfi we find that the surrounding text:typically, a large number of words exist in even a moderately sized set of documents, for example, in one test case we use 4303 words in 3893 documents, however, each document generally contains only a small number of words and hence, a is typically very sparse with almost 99% of the matrix entries being zero. existing document clustering methods include agglomerative clustering[25]<2>, the partitional k-means algorithm[7]<2>, projection based methods including lsa[***]<2>, self-organizing maps[18]<2> and multidimensional scaling[16]<2>. for computational efficiency required in on-line clustering, hybrid approaches have been considered such as in[5]<2> influence:2 type:2 pair index:286 citer id:308 citer title:co-clustering documents and words using bipartite spectral graph partitioning citer abstract:both document clustering and word clustering are well studied problems. most existing algorithms cluster documents and words separately but not simultaneously. in this paper we present the novel idea of modeling the document collection as a bipartite graph between documents and words, using which the simultaneous clustering problem can be posed as a bipartite graph partitioning problem. to solve the partitioning problem, we use a new spectral co-clustering algorithm that uses the second left and right singular vectors of an appropriately scaled word-document matrix to yield good bipartitionings. the spectral algorithm enjoys some optimality properties; it can be shown that the singular vectors solve a real relaxation to the np-complete graph bipartitioning problem. we present experimental results to verify that the resulting co-clustering algorithm works well in practice citee id:321 citee title:normalized cuts and image segmentation citee abstract:we propose a novel approach for solving the perfi ceptual grouping problem in vision rather than fofi cusing on local features and their consistencies in the image data our approach aims at extracting the global impression of an image we treat image segmentafi tion as a graph partitioning problem and propose a novel global criterion the normalized cut for segmentfi ing the graph the normalized cut criterion measures both the total dissimilarity between the dierent groups as well as the total similarity within the groups we show that an ecient computational technique based on a generalized eigenvalue problem can be used to opfi timize this criterion we have applied this approach to segmenting static images and found results very enfi couraging surrounding text:, weight(i) = pk eik. this leads to the normalizedcut criterion that was used in [***]<2> for image segmentation. note that for this choice of vertex weights, the vertex weight matrix w equals the degree matrix d, and weight(vi) = cut(v1, v2)+within(vi) for i = 1, 2, where within(vi) is the sum of the weights of edges with both end-points in vi influence:3 type:2 pair index:287 citer id:308 citer title:co-clustering documents and words using bipartite spectral graph partitioning citer abstract:both document clustering and word clustering are well studied problems. most existing algorithms cluster documents and words separately but not simultaneously. in this paper we present the novel idea of modeling the document collection as a bipartite graph between documents and words, using which the simultaneous clustering problem can be posed as a bipartite graph partitioning problem. to solve the partitioning problem, we use a new spectral co-clustering algorithm that uses the second left and right singular vectors of an appropriately scaled word-document matrix to yield good bipartitionings. the spectral algorithm enjoys some optimality properties; it can be shown that the singular vectors solve a real relaxation to the np-complete graph bipartitioning problem. we present experimental results to verify that the resulting co-clustering algorithm works well in practice citee id:322 citee title:impact of similarity measures on web-page clustering citee abstract:clustering of web documents enables (semi-)automatedcategorization, and facilitates certain types of searchfiany clustering method has to embed the documentsin a suitable similarity spacefi while several clusteringmethods and the associated similarity measures havebeen proposed in the past, there is no systematic comparativestudy of the impact of similarity metrics oncluster quality, possibly because the popular cost criteriado not readily translate across qualitatively di erent surrounding text:graph-theoretic techniques have also been considered for clustering. many earlier hierarchical agglomerative clustering algorithms[9]<2> and some recent work[3, ***]<2> model the similarity between documents by a graph whose vertices correspond to documents and weighted edges or hyperedges give the similarity between vertices. however these methods are computationally prohibitive for large collections since the amount of work required just to form the graph is quadratic in the number of documents influence:2 type:2 pair index:288 citer id:308 citer title:co-clustering documents and words using bipartite spectral graph partitioning citer abstract:both document clustering and word clustering are well studied problems. most existing algorithms cluster documents and words separately but not simultaneously. in this paper we present the novel idea of modeling the document collection as a bipartite graph between documents and words, using which the simultaneous clustering problem can be posed as a bipartite graph partitioning problem. to solve the partitioning problem, we use a new spectral co-clustering algorithm that uses the second left and right singular vectors of an appropriately scaled word-document matrix to yield good bipartitionings. the spectral algorithm enjoys some optimality properties; it can be shown that the singular vectors solve a real relaxation to the np-complete graph bipartitioning problem. we present experimental results to verify that the resulting co-clustering algorithm works well in practice citee id:323 citee title:information retrieval citee abstract:information retrieval is a wide, often loosely-defined term but in these pages i shall be concerned only with automatic information retrieval systemsfi automatic as opposed to manual and information as opposed to data or factfi unfortunately the word information can be very misleadingfi in the context of information retrieval (ir), information, in the technical meaning given in shannon"s theory of communication, is not readily measured (shannon and weaver1)fi in fact, in many cases one can surrounding text:word clustering has also been profitably used in the automatic classification of documents, see[1]<2>. more on word clustering may be found in [***]<2>. in this paper, we consider the problem of simultaneous or co-clustering of documents and words influence:2 type:3 pair index:289 citer id:308 citer title:co-clustering documents and words using bipartite spectral graph partitioning citer abstract:both document clustering and word clustering are well studied problems. most existing algorithms cluster documents and words separately but not simultaneously. in this paper we present the novel idea of modeling the document collection as a bipartite graph between documents and words, using which the simultaneous clustering problem can be posed as a bipartite graph partitioning problem. to solve the partitioning problem, we use a new spectral co-clustering algorithm that uses the second left and right singular vectors of an appropriately scaled word-document matrix to yield good bipartitionings. the spectral algorithm enjoys some optimality properties; it can be shown that the singular vectors solve a real relaxation to the np-complete graph bipartitioning problem. we present experimental results to verify that the resulting co-clustering algorithm works well in practice citee id:324 citee title:the effectiveness and efficiency of agglomerative hierarchic clustering in document retrieval citee abstract:an ad-hoc network is the cooperative engagement of a collection of mobile hosts without the required intervention of any centralized access point. in this paper we present an innovative design for the operation of such ad-hoc networks. the basic idea of the design is to operate each mobile host as a specialized router, which periodically advertises its view of the interconnection topology with other mobile hosts within the network. this amounts to a new sort of routing protocol. we have investigated modifications to the basic bellman-ford routing mechanisms, as specified by rip , to make it suitable for a dynamic and self-starting network mechanism as is required by users wishing to utilize ad hoc networks. our modifications address some of the previous objections to the use of bellman-ford, related to the poor looping properties of such algorithms in the face of broken links and the resulting time dependent nature of the interconnection topology describing the links between the mobile hosts. finally, we describe the ways in which the basic network-layer routing can be modified to provide mac-layer support for ad-hoc networks. surrounding text:typically, a large number of words exist in even a moderately sized set of documents, for example, in one test case we use 4303 words in 3893 documents, however, each document generally contains only a small number of words and hence, a is typically very sparse with almost 99% of the matrix entries being zero. existing document clustering methods include agglomerative clustering[***]<2>, the partitional k-means algorithm[7]<2>, projection based methods including lsa[21]<2>, self-organizing maps[18]<2> and multidimensional scaling[16]<2>. for computational efficiency required in on-line clustering, hybrid approaches have been considered such as in[5]<2> influence:2 type:2 pair index:290 citer id:332 citer title:combining language model with sentiment analysis for opinion citer abstract:this paper describes our participation in blog opinion retrieval task this year. we conduct experiments on “firtex�platform that is developed by our lab. language model is used to retrieve related blog unit. interactive knowledge is adopted to expand query for retrieve blog unit include opinion. then we introduce a novel extracting technology to extract text from retrieved blog-post. finally, lexicon based method is used to rerank the document by opinion citee id:125 citee title:a language modeling approach to information retrieval citee abstract:models of document indexing and document retrieval have been extensively studied. the integration of these two classes of models has been the goal of several researchers but it is a very difficult problem. we argue that much of the reason for this is the lack of an adequate indexing model. this suggests that perhaps a better indexing model would help solve the problem. however, we feel that making unwarranted parametric assumptions will not lead to better retrieval performance. furthermore, making prior assumptions about the similarity of documents is not warranted either. instead, we propose an approach to retrieval based on probabilistic language modeling. we estimate models for each document individually. our approach to modeling is non-parametric and integrates document indexing and document retrieval into a single model. one advantage of our approach is that collection statistics which are used heuristically in many other retrieval models are an integral part of our model. we have implemented our model and tested it empirically. our approach significantly outperforms standard tf.idf weighting on two different collections and query sets. surrounding text:after filtering spam permalink data, we obtain about 80g permalink data. statistics language model [***]<1> is used to retrieve related blog unit. interactive knowledge is adopted to expand query for retrieve blog unit including opinion influence:3 type:1 pair index:291 citer id:332 citer title:combining language model with sentiment analysis for opinion citer abstract:this paper describes our participation in blog opinion retrieval task this year. we conduct experiments on “firtex�platform that is developed by our lab. language model is used to retrieve related blog unit. interactive knowledge is adopted to expand query for retrieve blog unit include opinion. then we introduce a novel extracting technology to extract text from retrieved blog-post. finally, lexicon based method is used to rerank the document by opinion citee id:268 citee title:automatic query expansion using smart citee abstract:the smart information retrieval project emphasizes completely automatic approaches to the understanding and retrieval of large quantities of text. we continue our work in trec 3, performing runs in the routing, ad-hoc, and foreign language environments. our major focus is massive query expansion: adding from 300 to 530 terms to each query. these terms come from known relevant documents in the case of routing, and from just the top retrieved documents in the case of ad-hoc and spanish. surrounding text:3 query expansion using interactive knowledge query expansion is an effect method to bridge the semantic gap between the vocabulary of documents and that of users. there are three kinds of traditional query expansion approaches: lexicon based query expansion, global-analysis based query expansion [6]<3>, local-analysis based query expansion [***]<3>. lexicon based query expansion uses lexicon, such as wordnet, to select terms semantically related to query terms influence:3 type:3 pair index:292 citer id:332 citer title:combining language model with sentiment analysis for opinion citer abstract:this paper describes our participation in blog opinion retrieval task this year. we conduct experiments on “firtex�platform that is developed by our lab. language model is used to retrieve related blog unit. interactive knowledge is adopted to expand query for retrieve blog unit include opinion. then we introduce a novel extracting technology to extract text from retrieved blog-post. finally, lexicon based method is used to rerank the document by opinion citee id:92 citee title:a hierarchical dirichlet language model. natural language engineering citee abstract:we discuss a hierarchical probabilistic model whose predictions are similar to those of the popular language modelling procedure known as `smoothing'. a number of interesting differences from smoothing emerge. the insights gained from a probabilistic view of this problem point towards new directions for language modelling. the ideas of this paper are also applicable to other problems such as the modelling of triphomes in speech, and dna and protein sequences in molecular biology surrounding text:a direct solution to this problem is smoothing, which adjusts the mle so as to assign a nonzero probability to the unseen words and improve the accuracy of language model estimation. according to [***]<1>, there are three popular smoothing methods applied to the language modeling approaches to ad hoc ir: jelinek-mercer, dirichlet and absolute discounting, summarized in table 1. notice that du is the number of unique words in document d influence:3 type:1 pair index:293 citer id:349 citer title:constrained k-means clustering citer abstract:we consider practical methods for adding constraints to the k-means clustering algorithm in order to avoid local solutions with empty clusters or clusters having very few points. we often observe this phenomena when applying k-means to datasets where the number of dimensions is n  10 and the number of desired clusters is k  20. we propose explicitly adding k constraints to the underlying clustering optimization problem requiring that each cluster have at least a minimum number of points in it. we then investigate the resulting cluster assignment step. preliminary numerical tests on real datasets indicate the constrained approach is less prone to poor local solutions, producing a better summary of the underlying data. contrained k-means clustering citee id:350 citee title:density-based indexing for approximate nearest neighbor queries citee abstract:we consider the problem of performing nearest-neighbor queries efficiently over large high-dimensional databases. assuming that a full database scan to determine the nearest neighbor entries is not acceptable, we study the possibility of constructing an index structure over the database. it is well-accepted that traditional database indexing algorithms fail for high-dimensional data (say d?10 or 20 depending on the scheme). some arguments haveadvocated that nearest-neighbor queries surrounding text:but there are applications in which, given a value of k, one desires to have a cluster model with k non-empty clusters. these include the situation when the value of k is know a priori and applications in which the cluster model is utilized as a compressed version of a specific dataset [***, 8]<3>. the remaining portion of the paper is organized as follows influence:3 type:3 pair index:294 citer id:349 citer title:constrained k-means clustering citer abstract:we consider practical methods for adding constraints to the k-means clustering algorithm in order to avoid local solutions with empty clusters or clusters having very few points. we often observe this phenomena when applying k-means to datasets where the number of dimensions is n  10 and the number of desired clusters is k  20. we propose explicitly adding k constraints to the underlying clustering optimization problem requiring that each cluster have at least a minimum number of points in it. we then investigate the resulting cluster assignment step. preliminary numerical tests on real datasets indicate the constrained approach is less prone to poor local solutions, producing a better summary of the underlying data. contrained k-means clustering citee id:352 citee title:nonlinear programming citee abstract:this is a substantially expanded (by 130 pages) and improved edition of our best-selling nonlinear programming book. the treatment focuses on iterative algorithms for constrained and unconstrained optimization, lagrange multipliers and duality, large scale problems, and on the interface between continuous and discrete optimization surrounding text:h fixed. the stationary point computed satisfies the karush-kuhn-tucker (kkt) conditions [***]<3> for problem (2), which are necessary for optimality. algorithm 2 influence:3 type:1 pair index:295 citer id:349 citer title:constrained k-means clustering citer abstract:we consider practical methods for adding constraints to the k-means clustering algorithm in order to avoid local solutions with empty clusters or clusters having very few points. we often observe this phenomena when applying k-means to datasets where the number of dimensions is n  10 and the number of desired clusters is k  20. we propose explicitly adding k constraints to the underlying clustering optimization problem requiring that each cluster have at least a minimum number of points in it. we then investigate the resulting cluster assignment step. preliminary numerical tests on real datasets indicate the constrained approach is less prone to poor local solutions, producing a better summary of the underlying data. contrained k-means clustering citee id:340 citee title:compressed data cubes for olap aggregate query approximation on continuous dimensions citee abstract:efficiently answering decision support queries is an importantproblem. most of the work in this direction has been in the contextof the data cube. queries are efficiently answered by pre-computinglarge parts of the cube. besides having large space requirements,such pre-computation requires that the hierarchy along eachdimension be fixed (hence dimensions are categorical or pre-discretized). queries that take advantage of pre-computation canthus only drill-down or roll-up along this fixed hierarchy. anotherdisadvantage of existing pre-computation techniques is that thetarget measure, along with the aggregation function of interest, isfixed for each cube. queries over more than one target measure orusing different aggregation functions, would require pre-computinglarger data cubes. in this paper, we propose a new compressedrepresentation of the data cube that (a) drastically reduces storagerequirements, (b) does not require the discretization hierarchy alongeach query dimension to be fixed beforehand and (c) treats eachdimension as a potential target measure and supports multipleaggregation functions without additional storage costs. the tradeoffis approximate, yet relatively accurate, answers to queries. weoutline mechanisms to reduce the error in the approximation. ourperformance evaluation indicates that our compression techniqueeffectively addresses the limitations of existing approaches. surrounding text:but there are applications in which, given a value of k, one desires to have a cluster model with k non-empty clusters. these include the situation when the value of k is know a priori and applications in which the cluster model is utilized as a compressed version of a specific dataset [1, ***]<3>. the remaining portion of the paper is organized as follows influence:3 type:3 pair index:296 citer id:354 citer title:constraint-driven clustering citer abstract:clustering methods can be either data-driven or need-driven. data-driven methods intend to discover the true structure of the underlying data while need-driven methods aims at organizing the true structure to meet certain application requirements. thus, need-driven (e.g. constrained) clustering is able to find more useful and actionable clusters in applications such as energy aware sensor networks, privacy preservation, and market segmentation. however, the existing methods of constrained clustering require users to provide the number of clusters, which is often unknown in advance, but has a crucial impact on the clustering result. in this paper, we argue that a more natural way to generate actionable clusters is to let the application-specific constraints decide the number of clusters. for this purpose, we introduce a novel cluster model, constraint-driven clustering (cdc), which finds an a priori unspecified number of compact clusters that satisfy all user-provided constraints. two general types of constraints are considered, i.e. minimum significance constraints and minimum variance constraints, as well as combinations of these two types. we prove the np-hardness of the cdc problem with different constraints. we propose a novel dynamic data structure, the cd-tree, which organizes data points in leaf nodes such that each leaf node approximately satisfies the cdc constraints and minimizes the objective function. based on cd-trees, we develop an efficient algorithm to solve the new clustering problem. our experimental evaluation on synthetic and real datasets demonstrates the quality of the generated clusters and the scalability of the algorithm. categories and subject descriptors citee id:14 citee title:a condensation approach to privacy preserving data mining citee abstract:in recent years, privacy preserving data mining has become an important problem because of the large amount of personal data which is tracked by many business applications. in many cases, users are unwilling to provide personal information unless the privacy of sensitive information is guaranteed. in this paper, we propose a new framework for privacy preserving data mining of multi-dimensional data. previous work for privacy preserving data mining uses a perturbation approach which reconstructs data distributions in order to perform the mining. such an approach treats each dimension independently and therefore ignores the correlations between the di erent dimensions. in addition, it requires the development of a new distribution based algorithm for each data mining problem, since it does not use the multi-dimensional records, but uses aggregate distributions of the data as input. this leads to a fundamental re-design of data mining algorithms. in this paper, we will develop a new and exible approach for privacy preserving data mining which does not require new problem-specific algorithms, since it maps the original data set into a new anonymized data set. this anonymized data closely matches the characteristics of the original data including the correlations among the di erent dimensions. we present empirical results illustrating the e ectiveness of the method surrounding text:the k-anonymity model is defined on categorical data, and thus has different properties from our model which assumes a geometric space with the euclidean distance. [***]<2> introduces a k-nearest neighbor based algorithm to solve the k-anonymity problem for numerical data. domingo-ferrer et al. figure 8: results for abalone dataset (only significance constraints are specified). two related approaches, the ppmicrocluster algorithm [22]<2> and the condensation group approach [***]<2>, solve a constrained clustering problem similar to the cdc problem. we chose the ppmicrocluster algorithm as our comparison partner due to the following reasons influence:3 type:2 pair index:297 citer id:354 citer title:constraint-driven clustering citer abstract:clustering methods can be either data-driven or need-driven. data-driven methods intend to discover the true structure of the underlying data while need-driven methods aims at organizing the true structure to meet certain application requirements. thus, need-driven (e.g. constrained) clustering is able to find more useful and actionable clusters in applications such as energy aware sensor networks, privacy preservation, and market segmentation. however, the existing methods of constrained clustering require users to provide the number of clusters, which is often unknown in advance, but has a crucial impact on the clustering result. in this paper, we argue that a more natural way to generate actionable clusters is to let the application-specific constraints decide the number of clusters. for this purpose, we introduce a novel cluster model, constraint-driven clustering (cdc), which finds an a priori unspecified number of compact clusters that satisfy all user-provided constraints. two general types of constraints are considered, i.e. minimum significance constraints and minimum variance constraints, as well as combinations of these two types. we prove the np-hardness of the cdc problem with different constraints. we propose a novel dynamic data structure, the cd-tree, which organizes data points in leaf nodes such that each leaf node approximately satisfies the cdc constraints and minimizes the objective function. based on cd-trees, we develop an efficient algorithm to solve the new clustering problem. our experimental evaluation on synthetic and real datasets demonstrates the quality of the generated clusters and the scalability of the algorithm. categories and subject descriptors citee id:240 citee title:approximation algorithms for k-anonymity citee abstract:we consider the problem of releasing a table containing personal records, while ensuring individual privacy and maintaining data integrity to the extent possible. one of the techniques proposed in the literature is k-anonymization. a release is considered k-anonymous if the information corresponding to any individual in the release cannot be distinguished from that of at least k ? 1 other individuals whose information also appears in the release. in order to achieve k-anonymization, some of the entries of the table are either suppressed or generalized (e.g. an age value of 23 could be changed to the age range 20-25). the goal is to lose as little information as possible while ensuring that the release is k-anonymous. this optimization problem is referred to as the k-anonymity problem. we show that the k-anonymity problem is np-hard even when the attribute values are ternary and we are allowed only to suppress entries. on the positive side, we provide an o(k)-approximation algorithm for the problem. we also give improved positive results for the interesting cases with specific values of k �in particular, we give a 1.5-approximation algorithm for the special case of 2-anonymity, and a 2-approximation algorithm for 3-anonymity. surrounding text:1 other records in the table. [26]<2> and [***]<2> prove that k-anonymity with suppression is np-hard and study approximation algorithms. the k-anonymity model is defined on categorical data, and thus has different properties from our model which assumes a geometric space with the euclidean distance influence:3 type:2 pair index:298 citer id:354 citer title:constraint-driven clustering citer abstract:clustering methods can be either data-driven or need-driven. data-driven methods intend to discover the true structure of the underlying data while need-driven methods aims at organizing the true structure to meet certain application requirements. thus, need-driven (e.g. constrained) clustering is able to find more useful and actionable clusters in applications such as energy aware sensor networks, privacy preservation, and market segmentation. however, the existing methods of constrained clustering require users to provide the number of clusters, which is often unknown in advance, but has a crucial impact on the clustering result. in this paper, we argue that a more natural way to generate actionable clusters is to let the application-specific constraints decide the number of clusters. for this purpose, we introduce a novel cluster model, constraint-driven clustering (cdc), which finds an a priori unspecified number of compact clusters that satisfy all user-provided constraints. two general types of constraints are considered, i.e. minimum significance constraints and minimum variance constraints, as well as combinations of these two types. we prove the np-hardness of the cdc problem with different constraints. we propose a novel dynamic data structure, the cd-tree, which organizes data points in leaf nodes such that each leaf node approximately satisfies the cdc constraints and minimizes the objective function. based on cd-trees, we develop an efficient algorithm to solve the new clustering problem. our experimental evaluation on synthetic and real datasets demonstrates the quality of the generated clusters and the scalability of the algorithm. categories and subject descriptors citee id:355 citee title:max-min d-cluster formation in wireless ad hoc networks citee abstract:an ad hoc network may be logically represented as a set of clusters. the clusterheads form a d-hop dominating set. each node is at most d hops from a clusterhead. clusterheads form a virtual backbone and may be used to route packets for nodes in their cluster. previous heuristics restricted themselves to 1-hop clusters. we show that the minimum d-hop dominating set problem is np-complete. then we present a heuristic to form d-clusters in a wireless ad hoc network. nodes are assumed to have non-deterministic mobility pattern. clusters are formed by diffusing node identities along the wireless links. when the heuristic terminates, a node either becomes a clusterhead, or is at most d wireless hops away from its clusterhead. the value of d is a parameter of the heuristic. the heuristic can be run either at regular intervals, or whenever the network configuration changes. one of the features of the heuristic is that it tends to re-elect existing clusterheads even when the network configuration changes. this helps to reduce the communication overheads during transition from old clusterheads to new clusterheads. also, there is a tendency to evenly distribute the mobile nodes among the clusterheads, and evently distribute the responsibility of acting as clusterheads among all nodes. thus, the heuristic is fair and stable. simulation experiments demonstrate that the proposed heuristic is better than the two earlier heuristics, namely the lca and degree based solutions surrounding text:authors in [25]<2> present a clustering method on selforganizing sensor networks, for the purpose of grouping sensors into the optimal number of clusters that minimize the number of message transmissions. similar approaches on clustering in sensor networks also include [***, 8, 23]<2> etc. however, these clustering methods focus on dealing with engineering constraints instead of systematically studying the properties of the proposed clustering models influence:3 type:2 pair index:299 citer id:354 citer title:constraint-driven clustering citer abstract:clustering methods can be either data-driven or need-driven. data-driven methods intend to discover the true structure of the underlying data while need-driven methods aims at organizing the true structure to meet certain application requirements. thus, need-driven (e.g. constrained) clustering is able to find more useful and actionable clusters in applications such as energy aware sensor networks, privacy preservation, and market segmentation. however, the existing methods of constrained clustering require users to provide the number of clusters, which is often unknown in advance, but has a crucial impact on the clustering result. in this paper, we argue that a more natural way to generate actionable clusters is to let the application-specific constraints decide the number of clusters. for this purpose, we introduce a novel cluster model, constraint-driven clustering (cdc), which finds an a priori unspecified number of compact clusters that satisfy all user-provided constraints. two general types of constraints are considered, i.e. minimum significance constraints and minimum variance constraints, as well as combinations of these two types. we prove the np-hardness of the cdc problem with different constraints. we propose a novel dynamic data structure, the cd-tree, which organizes data points in leaf nodes such that each leaf node approximately satisfies the cdc constraints and minimizes the objective function. based on cd-trees, we develop an efficient algorithm to solve the new clustering problem. our experimental evaluation on synthetic and real datasets demonstrates the quality of the generated clusters and the scalability of the algorithm. categories and subject descriptors citee id:222 citee title:an energy-efficient hierarchical clustering algorithm for wireless sensor networks citee abstract:a wireless network consisting of a large number of small sensors with low-power transceivers can be an effective tool for gathering data in a variety of environments. the data collected by each sensor is communicated through the network to a single processing center that uses all reported data to determine characteristics of the environment or detect an event. the communication or message passing process must be designed to conserve the limited energy resources of the sensors. clustering sensors into groups, so that sensors communicate information only to clusterheads and then the clusterheads communicate the aggregated information to the processing center, may save energy. in this paper, we propose a distributed, randomized clustering algorithm to organize the sensors in a wireless sensor network into clusters. we then extend this algorithm to generate a hierarchy of clusterheads and observe that the energy savings increase with the number of levels in the hierarchy. results in stochastic geometry are used to derive solutions for the values of parameters of our algorithm that minimize the total energy spent in the network when all sensors report data through the clusterheads to the processing center. surrounding text:coyle et al. [***]<2> proposed a randomized algorithm to find the optimal number of cluster heads by minimizing the total energy spent on communicating between sensors and the information-processing center through the cluster heads. authors in [25]<2> present a clustering method on selforganizing sensor networks, for the purpose of grouping sensors into the optimal number of clusters that minimize the number of message transmissions influence:3 type:2 pair index:300 citer id:354 citer title:constraint-driven clustering citer abstract:clustering methods can be either data-driven or need-driven. data-driven methods intend to discover the true structure of the underlying data while need-driven methods aims at organizing the true structure to meet certain application requirements. thus, need-driven (e.g. constrained) clustering is able to find more useful and actionable clusters in applications such as energy aware sensor networks, privacy preservation, and market segmentation. however, the existing methods of constrained clustering require users to provide the number of clusters, which is often unknown in advance, but has a crucial impact on the clustering result. in this paper, we argue that a more natural way to generate actionable clusters is to let the application-specific constraints decide the number of clusters. for this purpose, we introduce a novel cluster model, constraint-driven clustering (cdc), which finds an a priori unspecified number of compact clusters that satisfy all user-provided constraints. two general types of constraints are considered, i.e. minimum significance constraints and minimum variance constraints, as well as combinations of these two types. we prove the np-hardness of the cdc problem with different constraints. we propose a novel dynamic data structure, the cd-tree, which organizes data points in leaf nodes such that each leaf node approximately satisfies the cdc constraints and minimizes the objective function. based on cd-trees, we develop an efficient algorithm to solve the new clustering problem. our experimental evaluation on synthetic and real datasets demonstrates the quality of the generated clusters and the scalability of the algorithm. categories and subject descriptors citee id:356 citee title:scalable clustering algorithms with balancing constraints citee abstract:clustering methods for data-mining problems must be extremely scalable. in addition, several data mining applications demand that the clusters obtained be balanced, i.e., of approximately the same size or importance. in this paper, we propose a general framework for scalable, balanced clustering. the data clustering process is broken down into three steps: sampling of a small representative subset of the points, clustering of the sampled data, and populating the initial clusters with the remaining data followed by refinements. first, we show that a simple uniform sampling from the original data is sufficient to get a representative subset with high probability. while the proposed framework allows a large class of algorithms to be used for clustering the sampled set, we focus on some popular parametric algorithms for ease of exposition. we then present algorithms to populate and refine the clusters. the algorithm for populating the clusters is based on a generalization of the stable marriage problem, whereas the refinement algorithm is a constrained iterative relocation scheme. the complexity of the overall method is o(kn log n) for obtaining k balanced clusters from n data points, which compares favorably with other existing techniques for balanced clustering. in addition to providing balancing guarantees, the clustering performance obtained using the proposed framework is comparable to and often better than the corresponding unconstrained solution. experimental results on several datasets, including high-dimensional (>20,000) ones, are provided to demonstrate the efficacy of the proposed framework. surrounding text:the generated clusters provide useful knowledge to support decision making in many applications. depending on perspective, clustering methods can be either data-driven or need-driven [***]<1>. the data-driven clustering methods intend to discover the true structure of the underlying data by grouping similar objects together while the need-driven clustering methods group objects based on not only similarity but also needs imposed by particular applications influence:1 type:2 pair index:301 citer id:354 citer title:constraint-driven clustering citer abstract:clustering methods can be either data-driven or need-driven. data-driven methods intend to discover the true structure of the underlying data while need-driven methods aims at organizing the true structure to meet certain application requirements. thus, need-driven (e.g. constrained) clustering is able to find more useful and actionable clusters in applications such as energy aware sensor networks, privacy preservation, and market segmentation. however, the existing methods of constrained clustering require users to provide the number of clusters, which is often unknown in advance, but has a crucial impact on the clustering result. in this paper, we argue that a more natural way to generate actionable clusters is to let the application-specific constraints decide the number of clusters. for this purpose, we introduce a novel cluster model, constraint-driven clustering (cdc), which finds an a priori unspecified number of compact clusters that satisfy all user-provided constraints. two general types of constraints are considered, i.e. minimum significance constraints and minimum variance constraints, as well as combinations of these two types. we prove the np-hardness of the cdc problem with different constraints. we propose a novel dynamic data structure, the cd-tree, which organizes data points in leaf nodes such that each leaf node approximately satisfies the cdc constraints and minimizes the objective function. based on cd-trees, we develop an efficient algorithm to solve the new clustering problem. our experimental evaluation on synthetic and real datasets demonstrates the quality of the generated clusters and the scalability of the algorithm. categories and subject descriptors citee id:11 citee title:a clustering scheme for hierarchical control in multi-hop wireless networks citee abstract:in this paper we present a clustering scheme to create a hierarchical control structure for multi-hop wireless networks. a cluster is defined as a subset of vertices, whose induced graph is connected. in addition, a cluster is required to obey certain constraints that are useful for management and scalability of the hierarchy. all these constraints cannot be met simultaneously for general graphs, but we show how such a clustering can be obtained for wireless network topologies. finally, we present an efficient distributed implementation of our clustering algorithm for a set of wireless nodes to create the set of desired clusters. surrounding text:authors in [25]<2> present a clustering method on selforganizing sensor networks, for the purpose of grouping sensors into the optimal number of clusters that minimize the number of message transmissions. similar approaches on clustering in sensor networks also include [4, ***, 23]<2> etc. however, these clustering methods focus on dealing with engineering constraints instead of systematically studying the properties of the proposed clustering models influence:3 type:2 pair index:302 citer id:354 citer title:constraint-driven clustering citer abstract:clustering methods can be either data-driven or need-driven. data-driven methods intend to discover the true structure of the underlying data while need-driven methods aims at organizing the true structure to meet certain application requirements. thus, need-driven (e.g. constrained) clustering is able to find more useful and actionable clusters in applications such as energy aware sensor networks, privacy preservation, and market segmentation. however, the existing methods of constrained clustering require users to provide the number of clusters, which is often unknown in advance, but has a crucial impact on the clustering result. in this paper, we argue that a more natural way to generate actionable clusters is to let the application-specific constraints decide the number of clusters. for this purpose, we introduce a novel cluster model, constraint-driven clustering (cdc), which finds an a priori unspecified number of compact clusters that satisfy all user-provided constraints. two general types of constraints are considered, i.e. minimum significance constraints and minimum variance constraints, as well as combinations of these two types. we prove the np-hardness of the cdc problem with different constraints. we propose a novel dynamic data structure, the cd-tree, which organizes data points in leaf nodes such that each leaf node approximately satisfies the cdc constraints and minimizes the objective function. based on cd-trees, we develop an efficient algorithm to solve the new clustering problem. our experimental evaluation on synthetic and real datasets demonstrates the quality of the generated clusters and the scalability of the algorithm. categories and subject descriptors citee id:349 citee title:constrained k-means clustering citee abstract:we consider practical methods for adding constraints to the k-means clustering algorithm in order to avoid local solutions with empty clusters or clusters having very few points. we often observe this phenomena when applying k-means to datasets where the number of dimensions is n  10 and the number of desired clusters is k  20. we propose explicitly adding k constraints to the underlying clustering optimization problem requiring that each cluster have at least a minimum number of points in it. we then investigate the resulting cluster assignment step. preliminary numerical tests on real datasets indicate the constrained approach is less prone to poor local solutions, producing a better summary of the underlying data. contrained k-means clustering surrounding text:for example, in market segmentation, relatively balanced customer groups are more preferable so that the knowledge extracted from each group has equal significance and are thus easier to evaluate [19]<2>. the special requirement of identifying balanced clusters can be effectively captured by imposing balancing constraints [***, 31, 7, 35]<2>. these models enable us to find useful clusters. cluster-level constraints. the research on clustering with constraints was introduced by [***]<2> and systematically studied in [31]<2>. both clustering models aim at partitioning data points into k clusters while each cluster satisfies a significance constraint. bradley et al. [***]<2> proposed a constrained k-means algorithm and suggested to achieve a cluster assignment by solving a minimum cost network flow problem. tung et al influence:2 type:2 pair index:303 citer id:354 citer title:constraint-driven clustering citer abstract:clustering methods can be either data-driven or need-driven. data-driven methods intend to discover the true structure of the underlying data while need-driven methods aims at organizing the true structure to meet certain application requirements. thus, need-driven (e.g. constrained) clustering is able to find more useful and actionable clusters in applications such as energy aware sensor networks, privacy preservation, and market segmentation. however, the existing methods of constrained clustering require users to provide the number of clusters, which is often unknown in advance, but has a crucial impact on the clustering result. in this paper, we argue that a more natural way to generate actionable clusters is to let the application-specific constraints decide the number of clusters. for this purpose, we introduce a novel cluster model, constraint-driven clustering (cdc), which finds an a priori unspecified number of compact clusters that satisfy all user-provided constraints. two general types of constraints are considered, i.e. minimum significance constraints and minimum variance constraints, as well as combinations of these two types. we prove the np-hardness of the cdc problem with different constraints. we propose a novel dynamic data structure, the cd-tree, which organizes data points in leaf nodes such that each leaf node approximately satisfies the cdc constraints and minimizes the objective function. based on cd-trees, we develop an efficient algorithm to solve the new clustering problem. our experimental evaluation on synthetic and real datasets demonstrates the quality of the generated clusters and the scalability of the algorithm. categories and subject descriptors citee id:357 citee title:localized minimum-energy broadcasting in ad-hoc networks citee abstract:in the minimum energy broadcasting problem, each node can adjust its transmission power in order to minimize total energy consumption but still enable a message originated from a source node to reach all the other nodes in an ad-hoc wireless networkfi in all existing solutions each node requires global network information (including distances between any two neighboring nodes in the network) in order to decide its own transmission radiusfi in this paper, we describe a new localized protocol where surrounding text:to prolong the lifetime of a sensor network, evenly distributing energy consumption among clusters is desired. since the energy consumption of message transmissions increases quadratically with the distance between communicating sensors [***]<3>, the variance of a group of sensors corresponds to the amount of energy consumed by those sensors on average. the minimum variance constraint allows to group sensors into clusters which are balanced in terms of energy consumption influence:3 type:3 pair index:304 citer id:354 citer title:constraint-driven clustering citer abstract:clustering methods can be either data-driven or need-driven. data-driven methods intend to discover the true structure of the underlying data while need-driven methods aims at organizing the true structure to meet certain application requirements. thus, need-driven (e.g. constrained) clustering is able to find more useful and actionable clusters in applications such as energy aware sensor networks, privacy preservation, and market segmentation. however, the existing methods of constrained clustering require users to provide the number of clusters, which is often unknown in advance, but has a crucial impact on the clustering result. in this paper, we argue that a more natural way to generate actionable clusters is to let the application-specific constraints decide the number of clusters. for this purpose, we introduce a novel cluster model, constraint-driven clustering (cdc), which finds an a priori unspecified number of compact clusters that satisfy all user-provided constraints. two general types of constraints are considered, i.e. minimum significance constraints and minimum variance constraints, as well as combinations of these two types. we prove the np-hardness of the cdc problem with different constraints. we propose a novel dynamic data structure, the cd-tree, which organizes data points in leaf nodes such that each leaf node approximately satisfies the cdc constraints and minimizes the objective function. based on cd-trees, we develop an efficient algorithm to solve the new clustering problem. our experimental evaluation on synthetic and real datasets demonstrates the quality of the generated clusters and the scalability of the algorithm. categories and subject descriptors citee id:305 citee title:clustering with constraints: feasibility issues and the k-means algorithm citee abstract:: recent work has looked at extending the k-means algorithmto incorporate background information in theform of instance level must-link and cannot-link constraintsfiwe introduce two ways of specifying additionalbackground information in the form of # and # constraintsthat operate on all instances but which can beinterpreted as conjunctions or disjunctions of instancelevel constraints and hence are easy to implementfi wepresent complexity results for the feasibility of clusteringunder each surrounding text:therefore, it is worthwhile to explore variants of the cdc model that allow the user to specify ranges instead of a fixed minimum value. finally, we believe that the cdc framework and the cd-tree algorithm can be generalized to include other constraint types, such as minimum separation constraints [***]<2>, to produce actionable clusters in an even broader category of applications. 7 influence:2 type:2 pair index:305 citer id:354 citer title:constraint-driven clustering citer abstract:clustering methods can be either data-driven or need-driven. data-driven methods intend to discover the true structure of the underlying data while need-driven methods aims at organizing the true structure to meet certain application requirements. thus, need-driven (e.g. constrained) clustering is able to find more useful and actionable clusters in applications such as energy aware sensor networks, privacy preservation, and market segmentation. however, the existing methods of constrained clustering require users to provide the number of clusters, which is often unknown in advance, but has a crucial impact on the clustering result. in this paper, we argue that a more natural way to generate actionable clusters is to let the application-specific constraints decide the number of clusters. for this purpose, we introduce a novel cluster model, constraint-driven clustering (cdc), which finds an a priori unspecified number of compact clusters that satisfy all user-provided constraints. two general types of constraints are considered, i.e. minimum significance constraints and minimum variance constraints, as well as combinations of these two types. we prove the np-hardness of the cdc problem with different constraints. we propose a novel dynamic data structure, the cd-tree, which organizes data points in leaf nodes such that each leaf node approximately satisfies the cdc constraints and minimizes the objective function. based on cd-trees, we develop an efficient algorithm to solve the new clustering problem. our experimental evaluation on synthetic and real datasets demonstrates the quality of the generated clusters and the scalability of the algorithm. categories and subject descriptors citee id:358 citee title:identifying and generating easy sets of constraints for clustering citee abstract:clustering under constraints is a recent innovation in the ar-tificial intelligence community that has yielded significantpractical benefit. however, recent work has shown thatfor some negative forms of constraints the associated sub-problem of just finding a feasible clustering is np-complete.these worst case results for the entire problem class say noth-ing of where and how prevalent easy problem instances are.in this work, we show that there are large pockets within theseproblem classes where clustering under constraints is easyand that using easy sets of constraints yields better empiricalresults. we then illustrate several sufficient conditions fromgraph theory to identify apriori where these easy problem in-stances are and present algorithms to create large and easy tosatisfy constraint sets. surrounding text:for example, specifying all clusters must have all their points more than δ distance apart can be achieved by must-linking all those points less than or equal to δ distance apart [13]<2>. specifying too many constraints is also problematic as algorithms that attempt to satisfy all constraints can quickly be overconstrained [***]<2> so that efficient algorithms to find just a single solution cannot be found even though they exist. clustering methods in sensor networks influence:2 type:2 pair index:306 citer id:354 citer title:constraint-driven clustering citer abstract:clustering methods can be either data-driven or need-driven. data-driven methods intend to discover the true structure of the underlying data while need-driven methods aims at organizing the true structure to meet certain application requirements. thus, need-driven (e.g. constrained) clustering is able to find more useful and actionable clusters in applications such as energy aware sensor networks, privacy preservation, and market segmentation. however, the existing methods of constrained clustering require users to provide the number of clusters, which is often unknown in advance, but has a crucial impact on the clustering result. in this paper, we argue that a more natural way to generate actionable clusters is to let the application-specific constraints decide the number of clusters. for this purpose, we introduce a novel cluster model, constraint-driven clustering (cdc), which finds an a priori unspecified number of compact clusters that satisfy all user-provided constraints. two general types of constraints are considered, i.e. minimum significance constraints and minimum variance constraints, as well as combinations of these two types. we prove the np-hardness of the cdc problem with different constraints. we propose a novel dynamic data structure, the cd-tree, which organizes data points in leaf nodes such that each leaf node approximately satisfies the cdc constraints and minimizes the objective function. based on cd-trees, we develop an efficient algorithm to solve the new clustering problem. our experimental evaluation on synthetic and real datasets demonstrates the quality of the generated clusters and the scalability of the algorithm. categories and subject descriptors citee id:359 citee title:the complexity of non-hierarchical clustering with constraints citee abstract:recent work has looked at extending clustering algorithms with instance level must-link (ml) and cannot-link (cl) background information. our work introduces ? and ? cluster level constraints that influence inter-cluster distances and cluster composition. the addition of background information, though useful at providing better clustering results, raises the important feasibility question: given a collection of constraints and a set of data, does there exist at least one partition of the data set satisfying all the constraints? we study the complexity of the feasibility problem for each of the above constraints separately and also for combinations of constraints. our results clearly delineate combinations of constraints for which the feasibility problem is computationally intractable (i.e., np-complete) from those for which the problem is efficiently solvable (i.e., in the computational class p). we also consider the ml and cl constraints in conjunctive and disjunctive normal forms (cnf and dnf respectively). we show that for ml constraints, the feasibility problem is intractable for cnf but efficiently solvable for dnf. unfortunately, for cl constraints, the feasibility problem is intractable for both cnf and dnf. this effectively means that cl-constraints in a non-trivial form cannot be efficiently incorporated into clustering algorithms. to overcome this, we introduce the notion of a choice-set of constraints and prove that the feasibility problem for choice-sets is efficiently solvable for both ml and cl constraints. we also present empirical results which indicate that the feasibility problem occurs extensively in real world problems. surrounding text:though it is possible to specify cluster level constraints as instance level constraints this would require a large number of constraints for even moderately sized data sets. for example, specifying all clusters must have all their points more than δ distance apart can be achieved by must-linking all those points less than or equal to δ distance apart [***]<2>. specifying too many constraints is also problematic as algorithms that attempt to satisfy all constraints can quickly be overconstrained [12]<2> so that efficient algorithms to find just a single solution cannot be found even though they exist. [33]<3> proposed a linear time algorithm to compute such a layout. a similar rectilinear layout is used in [***]<2> to prove the np-completeness of the feasibility problem for the must-link and  constraints. 3 influence:2 type:2 pair index:307 citer id:354 citer title:constraint-driven clustering citer abstract:clustering methods can be either data-driven or need-driven. data-driven methods intend to discover the true structure of the underlying data while need-driven methods aims at organizing the true structure to meet certain application requirements. thus, need-driven (e.g. constrained) clustering is able to find more useful and actionable clusters in applications such as energy aware sensor networks, privacy preservation, and market segmentation. however, the existing methods of constrained clustering require users to provide the number of clusters, which is often unknown in advance, but has a crucial impact on the clustering result. in this paper, we argue that a more natural way to generate actionable clusters is to let the application-specific constraints decide the number of clusters. for this purpose, we introduce a novel cluster model, constraint-driven clustering (cdc), which finds an a priori unspecified number of compact clusters that satisfy all user-provided constraints. two general types of constraints are considered, i.e. minimum significance constraints and minimum variance constraints, as well as combinations of these two types. we prove the np-hardness of the cdc problem with different constraints. we propose a novel dynamic data structure, the cd-tree, which organizes data points in leaf nodes such that each leaf node approximately satisfies the cdc constraints and minimizes the objective function. based on cd-trees, we develop an efficient algorithm to solve the new clustering problem. our experimental evaluation on synthetic and real datasets demonstrates the quality of the generated clusters and the scalability of the algorithm. categories and subject descriptors citee id:360 citee title:practical data-oriented microaggregation for statistical disclosure control citee abstract:microaggregation is a statistical disclosure control technique for microdata disseminated in statistical databases. raw microdata (i.e., individual records or data vectors) are grouped into small aggregates prior to publication. each aggregate should contain at least k data vectors to prevent disclosure of individual information, where k is a constant value preset by the data protector. no exact polynomial algorithms are known to date to microaggregate optimally, i.e., with minimal variability loss. methods in the literature rank data and partition them into groups of fixed-size; in the multivariate case, ranking is performed by projecting data vectors onto a single axis. in this paper, candidate optimal solutions to the multivariate and univariate microaggregation problems are characterized. in the univariate case, two heuristics based on hierarchical clustering and genetic algorithms are introduced which are data-oriented in that they try to preserve natural data aggregates. in the multivariate case, fixed-size and hierarchical clustering microaggregation algorithms are presented which do not require data to be projected onto a single dimension; such methods clearly reduce variability loss as compared to conventional multivariate microaggregation on projected data surrounding text:domingo-ferrer et al. [***]<2> study the optimal k-partition problem which can be considered as the k-anonymity model in the euclidean space. and it is a special case of our model where only significance constraints are allowed. and it is a special case of our model where only significance constraints are allowed. but in [***]<2> the complexity of the proposed problem is not analyzed. [17]<2> considers privacy preservation as a problem of finding the minimum number of hyperspheres with a fixed radius to cover a dataset satisfying that each hypersphere covers at least a certain number of data objects.  m2) � (2) thus, we could obtain a clustering whose objective value is smaller than or equal to the previous one by splitting c into c1 and c2, yielding a contradiction. 1note that a similar result is presented in [***]<2>. in the cd-tree, each entry of a leaf node represents an individual data point influence:1 type:2 pair index:308 citer id:354 citer title:constraint-driven clustering citer abstract:clustering methods can be either data-driven or need-driven. data-driven methods intend to discover the true structure of the underlying data while need-driven methods aims at organizing the true structure to meet certain application requirements. thus, need-driven (e.g. constrained) clustering is able to find more useful and actionable clusters in applications such as energy aware sensor networks, privacy preservation, and market segmentation. however, the existing methods of constrained clustering require users to provide the number of clusters, which is often unknown in advance, but has a crucial impact on the clustering result. in this paper, we argue that a more natural way to generate actionable clusters is to let the application-specific constraints decide the number of clusters. for this purpose, we introduce a novel cluster model, constraint-driven clustering (cdc), which finds an a priori unspecified number of compact clusters that satisfy all user-provided constraints. two general types of constraints are considered, i.e. minimum significance constraints and minimum variance constraints, as well as combinations of these two types. we prove the np-hardness of the cdc problem with different constraints. we propose a novel dynamic data structure, the cd-tree, which organizes data points in leaf nodes such that each leaf node approximately satisfies the cdc constraints and minimizes the objective function. based on cd-trees, we develop an efficient algorithm to solve the new clustering problem. our experimental evaluation on synthetic and real datasets demonstrates the quality of the generated clusters and the scalability of the algorithm. categories and subject descriptors citee id:129 citee title:a microeconomic data mining problem: customer-oriented catalog segmentation citee abstract:the microeconomic framework for data mining assumes that an enterprise chooses a decision maximizing the overall utility over all customers where the contribution of a customer is a function of the data available on that customer. in catalog segmentation, the enterprise wants to design k product catalogs of size r that maximize the overall number of catalog products purchased. however, there are many applications where a customer, once attracted to an enterprise, would purchase more products beyond the ones contained in the catalog. therefore, in this paper, we investigate an alternative problem formulation, that we call customer-oriented catalog segmentation, where the overall utility is measured by the number of customers that have at least a specified minimum interest t in the catalogs. we formally introduce the customer-oriented catalog segmentation problem and discuss its complexity. then we investigate two different paradigms to design efficient, approximate algorithms for the customer-oriented catalog segmentation problem, greedy (deterministic) and randomized algorithms. since greedy algorithms may be trapped in a local optimum and randomized algorithms crucially depend on a reasonable initial solution, we explore a combination of these two paradigms. our experimental evaluation on synthetic and real data demonstrates that the new algorithms yield catalogs of significantly higher utility compared to classical catalog segmentation algorithms surrounding text:ester et al. [***]<1> extend the catalog segmentation problem [24]<1> by introducing a new utility measured by the number of customers that have at least a specified minimum interest in the catalogs. a joint optimization approach is proposed in [21]<1> to address two issues in market segmentation, i influence:2 type:2 pair index:309 citer id:354 citer title:constraint-driven clustering citer abstract:clustering methods can be either data-driven or need-driven. data-driven methods intend to discover the true structure of the underlying data while need-driven methods aims at organizing the true structure to meet certain application requirements. thus, need-driven (e.g. constrained) clustering is able to find more useful and actionable clusters in applications such as energy aware sensor networks, privacy preservation, and market segmentation. however, the existing methods of constrained clustering require users to provide the number of clusters, which is often unknown in advance, but has a crucial impact on the clustering result. in this paper, we argue that a more natural way to generate actionable clusters is to let the application-specific constraints decide the number of clusters. for this purpose, we introduce a novel cluster model, constraint-driven clustering (cdc), which finds an a priori unspecified number of compact clusters that satisfy all user-provided constraints. two general types of constraints are considered, i.e. minimum significance constraints and minimum variance constraints, as well as combinations of these two types. we prove the np-hardness of the cdc problem with different constraints. we propose a novel dynamic data structure, the cd-tree, which organizes data points in leaf nodes such that each leaf node approximately satisfies the cdc constraints and minimizes the objective function. based on cd-trees, we develop an efficient algorithm to solve the new clustering problem. our experimental evaluation on synthetic and real datasets demonstrates the quality of the generated clusters and the scalability of the algorithm. categories and subject descriptors citee id:32 citee title:a disc-based approach to data summarization and privacy preservation citee abstract:data summarization has been recognized as a fundamental operation in database systems and data mining with important applications such as data compression and privacy preservation. while the existing methods such as cfvalues and databubbles may perform reasonably well, they cannot provide any guarantees on the quality of their results. in this paper, we introduce a summarization approach for numerical data based on discs formalizing the notion of quality. our objective is to find a minimal set of discs, i.e. spheres satisfying a radius and a significance constraint, covering the given dataset. since the proposed problem is np-complete, we design two different approximation algorithms. these algorithms have a quality guarantee, but they do not scale well to large databases. however, the machinery from approximation algorithms allows a precise characterization of a further, heuristic algorithm. this heuristic, efficient algorithm exploits multi-dimensional index structures and can be well-integrated with database systems. the experiments show that our heuristic algorithm generates summaries that outperform the state-of-the-art data bubbles in terms of internal measures as well as in terms of external measures when using the data summaries as input for clustering methods. surrounding text:but in [14]<2> the complexity of the proposed problem is not analyzed. [***]<2> considers privacy preservation as a problem of finding the minimum number of hyperspheres with a fixed radius to cover a dataset satisfying that each hypersphere covers at least a certain number of data objects. a similar model, named ppmicrocluster, is studied in [22]<2> which requires both significance and radius constraints influence:2 type:2 pair index:310 citer id:354 citer title:constraint-driven clustering citer abstract:clustering methods can be either data-driven or need-driven. data-driven methods intend to discover the true structure of the underlying data while need-driven methods aims at organizing the true structure to meet certain application requirements. thus, need-driven (e.g. constrained) clustering is able to find more useful and actionable clusters in applications such as energy aware sensor networks, privacy preservation, and market segmentation. however, the existing methods of constrained clustering require users to provide the number of clusters, which is often unknown in advance, but has a crucial impact on the clustering result. in this paper, we argue that a more natural way to generate actionable clusters is to let the application-specific constraints decide the number of clusters. for this purpose, we introduce a novel cluster model, constraint-driven clustering (cdc), which finds an a priori unspecified number of compact clusters that satisfy all user-provided constraints. two general types of constraints are considered, i.e. minimum significance constraints and minimum variance constraints, as well as combinations of these two types. we prove the np-hardness of the cdc problem with different constraints. we propose a novel dynamic data structure, the cd-tree, which organizes data points in leaf nodes such that each leaf node approximately satisfies the cdc constraints and minimizes the objective function. based on cd-trees, we develop an efficient algorithm to solve the new clustering problem. our experimental evaluation on synthetic and real datasets demonstrates the quality of the generated clusters and the scalability of the algorithm. categories and subject descriptors citee id:361 citee title:optimal energy aware clustering in sensor network citee abstract:sensor networks is among the fastest growing technologies that have the potential of changing our lives drastically. these collaborative, dynamic and distributed computing and communicating systems will be self organizing. they will have capabilities of distributing a task among themselves for efficient computation. there are many challenges in implementation of such systems: energy dissipation and clustering being one of them. in order to maintain a certain degree of service quality and a reasonable system lifetime, energy needs to be optimized at every stage of system operation. sensor node clustering is another very important optimization problem. nodes that are clustered together will easily be able to communicate with each other. considering energy as an optimization parameter while clustering is imperative. in this paper we study the theoretical aspects of the clustering problem in sensor networks with application to energy optimization. we illustrate an optimal algorithm for clustering the sensor nodes such that each cluster (which has a master) is balanced and the total distance between sensor nodes and master nodes is minimized. balancing the clusters is needed for evenly distributing the load on all master nodes. minimizing the total distance helps in reducing the communication overhead and hence the energy dissipation. this problem (which we call balanced k-clustering) is modeled as a mincost flow problem which can be solved optimally using existing techniques. surrounding text:to motivate constraint-driven clustering with minimum significance and variance constraints, we use applications in energy aware sensor networks and privacy preservation as our running examples. energy aware sensor networks [***, 20, 7]<3>: grouping sensors into clusters is an important problem in sensor networks since it can drastically affect the network’s communication energy consumption [***]<3>. normally, a master node is chosen from sensors in each cluster or deployed to the central area of each cluster. to motivate constraint-driven clustering with minimum significance and variance constraints, we use applications in energy aware sensor networks and privacy preservation as our running examples. energy aware sensor networks [***, 20, 7]<3>: grouping sensors into clusters is an important problem in sensor networks since it can drastically affect the network’s communication energy consumption [***]<3>. normally, a master node is chosen from sensors in each cluster or deployed to the central area of each cluster influence:2 type:2 pair index:311 citer id:354 citer title:constraint-driven clustering citer abstract:clustering methods can be either data-driven or need-driven. data-driven methods intend to discover the true structure of the underlying data while need-driven methods aims at organizing the true structure to meet certain application requirements. thus, need-driven (e.g. constrained) clustering is able to find more useful and actionable clusters in applications such as energy aware sensor networks, privacy preservation, and market segmentation. however, the existing methods of constrained clustering require users to provide the number of clusters, which is often unknown in advance, but has a crucial impact on the clustering result. in this paper, we argue that a more natural way to generate actionable clusters is to let the application-specific constraints decide the number of clusters. for this purpose, we introduce a novel cluster model, constraint-driven clustering (cdc), which finds an a priori unspecified number of compact clusters that satisfy all user-provided constraints. two general types of constraints are considered, i.e. minimum significance constraints and minimum variance constraints, as well as combinations of these two types. we prove the np-hardness of the cdc problem with different constraints. we propose a novel dynamic data structure, the cd-tree, which organizes data points in leaf nodes such that each leaf node approximately satisfies the cdc constraints and minimizes the objective function. based on cd-trees, we develop an efficient algorithm to solve the new clustering problem. our experimental evaluation on synthetic and real datasets demonstrates the quality of the generated clusters and the scalability of the algorithm. categories and subject descriptors citee id:300 citee title:clustering and visualization of retail market baskets citee abstract:transaction analysis, including clustering of market baskets, is a key application of data mining to the retail industry. this domain has some specific requirements, such as the need for obtaining easily interpretable and actionable results. it also exhibits some very challenging characteristics, mostly stemming from the fact that the data have thousands of features and are highly non-gaussian and sparse. this chapter proposes a relationship-based approach to clustering such data that tries to sidestep the “curse-of-dimensionality�issue by working in a suitable similarity space instead of the original high-dimensional feature space. this intermediary similarity space can be suitably tailored to satisfy business criteria such as requiring customer clusters to represent comparable amounts of revenue. we apply efficient and scalable graph-partitioning-based clustering techniques in this space. the output from the clustering algorithm is used to reorder the data points so that the resulting permuted similarity matrix can be readily visualized in two dimensions, with clusters showing up as bands. the visualization is very helpful for assessing and improving clustering. for example, actionable recommendations for splitting or merging clusters can be easily derived, and it also guides the user toward a suitable number of clusters. results are presented on a real retail industry data set of several thousand customers and products surrounding text:alternatively, complex application needs can be captured by corresponding constraints. for example, in market segmentation, relatively balanced customer groups are more preferable so that the knowledge extracted from each group has equal significance and are thus easier to evaluate [***]<2>. the special requirement of identifying balanced clusters can be effectively captured by imposing balancing constraints [9, 31, 7, 35]<2> influence:3 type:2 pair index:312 citer id:354 citer title:constraint-driven clustering citer abstract:clustering methods can be either data-driven or need-driven. data-driven methods intend to discover the true structure of the underlying data while need-driven methods aims at organizing the true structure to meet certain application requirements. thus, need-driven (e.g. constrained) clustering is able to find more useful and actionable clusters in applications such as energy aware sensor networks, privacy preservation, and market segmentation. however, the existing methods of constrained clustering require users to provide the number of clusters, which is often unknown in advance, but has a crucial impact on the clustering result. in this paper, we argue that a more natural way to generate actionable clusters is to let the application-specific constraints decide the number of clusters. for this purpose, we introduce a novel cluster model, constraint-driven clustering (cdc), which finds an a priori unspecified number of compact clusters that satisfy all user-provided constraints. two general types of constraints are considered, i.e. minimum significance constraints and minimum variance constraints, as well as combinations of these two types. we prove the np-hardness of the cdc problem with different constraints. we propose a novel dynamic data structure, the cd-tree, which organizes data points in leaf nodes such that each leaf node approximately satisfies the cdc constraints and minimizes the objective function. based on cd-trees, we develop an efficient algorithm to solve the new clustering problem. our experimental evaluation on synthetic and real datasets demonstrates the quality of the generated clusters and the scalability of the algorithm. categories and subject descriptors citee id:362 citee title:joint optimization of customer segmentation and marketing policy to maximize long-term profitability citee abstract:with the advent of one-to-one marketing media, e.g. targeted direct mail or internet marketing, the opportunities to develop targeted marketing campaigns are enhanced in such a way that it is now both organizationally and economically feasible to profitably support a substantially larger number of marketing segments. however, the problem of what segments to distinguish, and what actions to take towards the different segments increases substantially in such an environment. a systematic analytic procedure optimizing both steps would be very welcome.in this study, we present a joint optimization approach addressing two issues: (1) the segmentation of customers into homogeneous groups of customers, (2) determining the optimal policy (i.e., what action to take from a set of available actions) towards each segment. we implement this joint optimization framework in a direct-mail setting for a charitable organization. many previous studies in this area highlighted the importance of the following variables: r(ecency), f(requency), and m(onetary value). we use these variables to segment customers. in a second step, we determine which marketing policy is optimal using markov decision processes, following similar previous applications. the attractiveness of this stochastic dynamic programming procedure is based on the long-run maximization of expected average profit. our contribution lies in the combination of both steps into one optimization framework to obtain an optimal allocation of marketing expenditures. moreover, we control segment stability and policy performance by a bootstrap procedure. our framework is illustrated by a real-life application. the results show that the proposed model outperforms a chaid segmentation. surrounding text:[16]<1> extend the catalog segmentation problem [24]<1> by introducing a new utility measured by the number of customers that have at least a specified minimum interest in the catalogs. a joint optimization approach is proposed in [***]<1> to address two issues in market segmentation, i. e influence:3 type:2 pair index:313 citer id:354 citer title:constraint-driven clustering citer abstract:clustering methods can be either data-driven or need-driven. data-driven methods intend to discover the true structure of the underlying data while need-driven methods aims at organizing the true structure to meet certain application requirements. thus, need-driven (e.g. constrained) clustering is able to find more useful and actionable clusters in applications such as energy aware sensor networks, privacy preservation, and market segmentation. however, the existing methods of constrained clustering require users to provide the number of clusters, which is often unknown in advance, but has a crucial impact on the clustering result. in this paper, we argue that a more natural way to generate actionable clusters is to let the application-specific constraints decide the number of clusters. for this purpose, we introduce a novel cluster model, constraint-driven clustering (cdc), which finds an a priori unspecified number of compact clusters that satisfy all user-provided constraints. two general types of constraints are considered, i.e. minimum significance constraints and minimum variance constraints, as well as combinations of these two types. we prove the np-hardness of the cdc problem with different constraints. we propose a novel dynamic data structure, the cd-tree, which organizes data points in leaf nodes such that each leaf node approximately satisfies the cdc constraints and minimizes the objective function. based on cd-trees, we develop an efficient algorithm to solve the new clustering problem. our experimental evaluation on synthetic and real datasets demonstrates the quality of the generated clusters and the scalability of the algorithm. categories and subject descriptors citee id:363 citee title:on robust and effective k-anonymity in large databases citee abstract:the challenge of privacy-preserving data mining lies in respecting privacy requirements while discovering the original interesting patterns or structures. existing methods loose the correlations among attributes by transforming the different attributes independently, or cannot guarantee the minimum ion level required by legal policies. in this paper, we propose a novel privacy-preserving transformation framework for distance-based mining operations based on the concept of privacy-preserving microclusters that satisfy a privacy constraint as well as a significance constraint. our framework well extends the robustness of the state-of-the-art k-anonymity model by introducing a privacy constraint (minimum radius) while keeping its effectiveness by a significance constraint (minimum number of corresponding data records). the privacy-preserving microclusters are made public for data mining purposes, but the original data records are kept private. we present efficient methods for generating and maintaining privacy-preserving microclusters and show that data mining operations such as clustering can easily be adapted to the public data represented by microclusters instead of the private data records. the experiment demonstrates that the proposed methods achieve accurate clusterings results while preserving the privacy. surrounding text:in the context of privacy preservation, again, it is typically unreasonable to specify the number of groups in advance. ppmicrocluster model [***]<3> requires both minimum significance and minimum radius constraints to preserve privacy. compared to this model, our constraint-driven clustering model adopts a more practical constraint, i influence:2 type:2 pair index:314 citer id:354 citer title:constraint-driven clustering citer abstract:clustering methods can be either data-driven or need-driven. data-driven methods intend to discover the true structure of the underlying data while need-driven methods aims at organizing the true structure to meet certain application requirements. thus, need-driven (e.g. constrained) clustering is able to find more useful and actionable clusters in applications such as energy aware sensor networks, privacy preservation, and market segmentation. however, the existing methods of constrained clustering require users to provide the number of clusters, which is often unknown in advance, but has a crucial impact on the clustering result. in this paper, we argue that a more natural way to generate actionable clusters is to let the application-specific constraints decide the number of clusters. for this purpose, we introduce a novel cluster model, constraint-driven clustering (cdc), which finds an a priori unspecified number of compact clusters that satisfy all user-provided constraints. two general types of constraints are considered, i.e. minimum significance constraints and minimum variance constraints, as well as combinations of these two types. we prove the np-hardness of the cdc problem with different constraints. we propose a novel dynamic data structure, the cd-tree, which organizes data points in leaf nodes such that each leaf node approximately satisfies the cdc constraints and minimizes the objective function. based on cd-trees, we develop an efficient algorithm to solve the new clustering problem. our experimental evaluation on synthetic and real datasets demonstrates the quality of the generated clusters and the scalability of the algorithm. categories and subject descriptors citee id:131 citee title:a microeconomic view of data mining citee abstract:we present a rigorous framework, based on optimization, for evaluating data miningoperations such as associations and clustering, in terms of their utility in decisionmakingfithis framework leads quickly to some interesting computational problemsrelated to sensitivity analysis, segmentation and the theory of gamesfidepartment of computer science, cornell university, ithaca ny 14853fi email: kleinber@csficornellfiedufisupported in part by an alfred pfi sloan research fellowship and by nsf surrounding text:authors in [25]<2> present a clustering method on selforganizing sensor networks, for the purpose of grouping sensors into the optimal number of clusters that minimize the number of message transmissions. similar approaches on clustering in sensor networks also include [4, 8, ***]<2> etc. however, these clustering methods focus on dealing with engineering constraints instead of systematically studying the properties of the proposed clustering models influence:2 type:2 pair index:315 citer id:354 citer title:constraint-driven clustering citer abstract:clustering methods can be either data-driven or need-driven. data-driven methods intend to discover the true structure of the underlying data while need-driven methods aims at organizing the true structure to meet certain application requirements. thus, need-driven (e.g. constrained) clustering is able to find more useful and actionable clusters in applications such as energy aware sensor networks, privacy preservation, and market segmentation. however, the existing methods of constrained clustering require users to provide the number of clusters, which is often unknown in advance, but has a crucial impact on the clustering result. in this paper, we argue that a more natural way to generate actionable clusters is to let the application-specific constraints decide the number of clusters. for this purpose, we introduce a novel cluster model, constraint-driven clustering (cdc), which finds an a priori unspecified number of compact clusters that satisfy all user-provided constraints. two general types of constraints are considered, i.e. minimum significance constraints and minimum variance constraints, as well as combinations of these two types. we prove the np-hardness of the cdc problem with different constraints. we propose a novel dynamic data structure, the cd-tree, which organizes data points in leaf nodes such that each leaf node approximately satisfies the cdc constraints and minimizes the objective function. based on cd-trees, we develop an efficient algorithm to solve the new clustering problem. our experimental evaluation on synthetic and real datasets demonstrates the quality of the generated clusters and the scalability of the algorithm. categories and subject descriptors citee id:364 citee title:efficient clustering algorithms for self-organizing wireless sensor networks citee abstract:: self-organization of wireless sensor networks, which involves network decomposition into connected clusters, is a challenging task because of the limited bandwidth and energy resources available in these networksfi in this paper, we make contributions towards improving the efficiency of self-organization in wireless sensor networksfi we first present a novel approach for message-efficient clustering, in which nodes allocate local "growth budgets" to neighborsfi we introduce two algorithms that surrounding text:[5]<2> proposed a randomized algorithm to find the optimal number of cluster heads by minimizing the total energy spent on communicating between sensors and the information-processing center through the cluster heads. authors in [***]<2> present a clustering method on selforganizing sensor networks, for the purpose of grouping sensors into the optimal number of clusters that minimize the number of message transmissions. similar approaches on clustering in sensor networks also include [4, 8, 23]<2> etc influence:2 type:2 pair index:316 citer id:354 citer title:constraint-driven clustering citer abstract:clustering methods can be either data-driven or need-driven. data-driven methods intend to discover the true structure of the underlying data while need-driven methods aims at organizing the true structure to meet certain application requirements. thus, need-driven (e.g. constrained) clustering is able to find more useful and actionable clusters in applications such as energy aware sensor networks, privacy preservation, and market segmentation. however, the existing methods of constrained clustering require users to provide the number of clusters, which is often unknown in advance, but has a crucial impact on the clustering result. in this paper, we argue that a more natural way to generate actionable clusters is to let the application-specific constraints decide the number of clusters. for this purpose, we introduce a novel cluster model, constraint-driven clustering (cdc), which finds an a priori unspecified number of compact clusters that satisfy all user-provided constraints. two general types of constraints are considered, i.e. minimum significance constraints and minimum variance constraints, as well as combinations of these two types. we prove the np-hardness of the cdc problem with different constraints. we propose a novel dynamic data structure, the cd-tree, which organizes data points in leaf nodes such that each leaf node approximately satisfies the cdc constraints and minimizes the objective function. based on cd-trees, we develop an efficient algorithm to solve the new clustering problem. our experimental evaluation on synthetic and real datasets demonstrates the quality of the generated clusters and the scalability of the algorithm. categories and subject descriptors citee id:365 citee title:on the complexity of optimal k-anonymity citee abstract:the technique of k-anonymization has been proposed in the literature as an alternative way to release public in-formation, while ensuring both data privacy and data integrityfi we prove that two general versions of opti-mal k-anonymization of relations are np-hard, includ-ing the suppression version which amounts to choosing a minimum number of entries to delete from the relationfi we also present a polynomial time algorithm for optimal k-anonymity that achieves an approximation ratio inde-pendent of the size of the database, when k is constantfi in particular, it is a o(k log k)-approximation where the constant in the big-o is no more than 4fihowever, the runtime of the algorithm is exponential in kfi a slightly more clever algorithm removes this condition, but is a o(k log m)-approximation, where m is the degree of the relationfi we believe this algorithm could potentially be quite fast in practicefi surrounding text:1 other records in the table. [***]<2> and [3]<2> prove that k-anonymity with suppression is np-hard and study approximation algorithms. the k-anonymity model is defined on categorical data, and thus has different properties from our model which assumes a geometric space with the euclidean distance influence:3 type:3 pair index:317 citer id:354 citer title:constraint-driven clustering citer abstract:clustering methods can be either data-driven or need-driven. data-driven methods intend to discover the true structure of the underlying data while need-driven methods aims at organizing the true structure to meet certain application requirements. thus, need-driven (e.g. constrained) clustering is able to find more useful and actionable clusters in applications such as energy aware sensor networks, privacy preservation, and market segmentation. however, the existing methods of constrained clustering require users to provide the number of clusters, which is often unknown in advance, but has a crucial impact on the clustering result. in this paper, we argue that a more natural way to generate actionable clusters is to let the application-specific constraints decide the number of clusters. for this purpose, we introduce a novel cluster model, constraint-driven clustering (cdc), which finds an a priori unspecified number of compact clusters that satisfy all user-provided constraints. two general types of constraints are considered, i.e. minimum significance constraints and minimum variance constraints, as well as combinations of these two types. we prove the np-hardness of the cdc problem with different constraints. we propose a novel dynamic data structure, the cd-tree, which organizes data points in leaf nodes such that each leaf node approximately satisfies the cdc constraints and minimizes the objective function. based on cd-trees, we develop an efficient algorithm to solve the new clustering problem. our experimental evaluation on synthetic and real datasets demonstrates the quality of the generated clusters and the scalability of the algorithm. categories and subject descriptors citee id:366 citee title:generalizing data to provide anonymity when disclosing information citee abstract:the proliferation of information on the internet and access to fast computers with large storage capacities has increased the volume of information collected and disseminated about individuals. the existence of these data sources makes it much easier to re-identify individuals whose private information is released in data believed to be anonymous. at the same time, increasing demands are made on organizations to release individualized data rather than aggregate statistical information. even when explicit identifiers, such as name and phone number, are removed or encrypted when releasing individualized data, other characteristic data, which we term quasi-identifiers, can exist which allow the data recipient to re-identify individuals to whom the data refer. in this paper, we provide a computational disclosure technique for releasing information from a private table such that the identity of any individual to whom the released data refer cannot be definitively recognized. our approach protects against linking to other data. it is based on the concepts of generalization, by which stored values can be replaced with semantically consistent but less precise alternatives, and of k-anonymity. a table is said to provide k-anonymity when the contained data do not allow the recipient to associate the released information with a set of individuals smaller than k. we introduce the notions of generalized table and of minimal generalization of a table with respect to a k-anonymity requirement. as an optimization problem, the objective is to minimally distort the data while providing adequate protection. we describe an algorithm that, given a table, efficiently computes a preferred minimal generalization to provide anonymity. surrounding text:moreover, in this application, it is natural to have the constraints decide the appropriate number of clusters instead of specifying a number in advance. privacy preservation [***, 30]<3>: in a privacy preservation application, we may want to release personal records to the public without a privacy breach. to achieve this, we can group records into small clusters and release the summary of each cluster to the public influence:3 type:3 pair index:318 citer id:354 citer title:constraint-driven clustering citer abstract:clustering methods can be either data-driven or need-driven. data-driven methods intend to discover the true structure of the underlying data while need-driven methods aims at organizing the true structure to meet certain application requirements. thus, need-driven (e.g. constrained) clustering is able to find more useful and actionable clusters in applications such as energy aware sensor networks, privacy preservation, and market segmentation. however, the existing methods of constrained clustering require users to provide the number of clusters, which is often unknown in advance, but has a crucial impact on the clustering result. in this paper, we argue that a more natural way to generate actionable clusters is to let the application-specific constraints decide the number of clusters. for this purpose, we introduce a novel cluster model, constraint-driven clustering (cdc), which finds an a priori unspecified number of compact clusters that satisfy all user-provided constraints. two general types of constraints are considered, i.e. minimum significance constraints and minimum variance constraints, as well as combinations of these two types. we prove the np-hardness of the cdc problem with different constraints. we propose a novel dynamic data structure, the cd-tree, which organizes data points in leaf nodes such that each leaf node approximately satisfies the cdc constraints and minimizes the objective function. based on cd-trees, we develop an efficient algorithm to solve the new clustering problem. our experimental evaluation on synthetic and real datasets demonstrates the quality of the generated clusters and the scalability of the algorithm. categories and subject descriptors citee id:160 citee title:a scalable approach to balanced, high-dimensional clustering of market-baskets citee abstract:this paper presents opossum, a novel similarity-based clustering approach based on constrained, weighted graph-partitioningfi opossum is particularly attuned to real-life market baskets, characterized by very high-dimensional, highly sparse customer-product matrices with positive ordinal attribute values and significant amount of outliersfi since it is built on top of metis, a well-known and highly efficient graph partitioning algorithm, it inherits the scalable and easily parallelizeable surrounding text:present an efficient three-step scheme [6, 7]<2> which gives a very general methodology for scaling up balanced clustering algorithms. the same problem is converted to a graph partition problem in [***]<2> to discover balanced clusterings. however, the complexity of the graph-based approach is higher than the one in [6]<2> influence:2 type:2 pair index:319 citer id:354 citer title:constraint-driven clustering citer abstract:clustering methods can be either data-driven or need-driven. data-driven methods intend to discover the true structure of the underlying data while need-driven methods aims at organizing the true structure to meet certain application requirements. thus, need-driven (e.g. constrained) clustering is able to find more useful and actionable clusters in applications such as energy aware sensor networks, privacy preservation, and market segmentation. however, the existing methods of constrained clustering require users to provide the number of clusters, which is often unknown in advance, but has a crucial impact on the clustering result. in this paper, we argue that a more natural way to generate actionable clusters is to let the application-specific constraints decide the number of clusters. for this purpose, we introduce a novel cluster model, constraint-driven clustering (cdc), which finds an a priori unspecified number of compact clusters that satisfy all user-provided constraints. two general types of constraints are considered, i.e. minimum significance constraints and minimum variance constraints, as well as combinations of these two types. we prove the np-hardness of the cdc problem with different constraints. we propose a novel dynamic data structure, the cd-tree, which organizes data points in leaf nodes such that each leaf node approximately satisfies the cdc constraints and minimizes the objective function. based on cd-trees, we develop an efficient algorithm to solve the new clustering problem. our experimental evaluation on synthetic and real datasets demonstrates the quality of the generated clusters and the scalability of the algorithm. categories and subject descriptors citee id:367 citee title:k-anonymity: a model for protecting privacy citee abstract:consider a data holder, such as a hospital or a bank, that has a privately held collection of person-specific, field structured datafi suppose the data holder wants to share a version of the data with researchersfi how can a data holder release a version of its private data with scientific guarantees that the individuals who are the subjects of the data cannot be re-identified while the data remain practically useful? the solution provided in this paper includes a formal protection model named k-anonymity and a set of accompanying policies for deploymentfi a release provides k-anonymity protection if the information for each person contained in the release cannot be distinguished from at least k-1 individuals whose information also appears in the releasefi this paper also examines re-identification attacks that can be realized on releases that adhere to k-anonymity unless accompanying policies are respectedfi the k-anonymity protection model is important because it forms the basis on which the real-world systems known as datafly, m-argus and k-similar provide guarantees of privacy protectionfi surrounding text:moreover, in this application, it is natural to have the constraints decide the appropriate number of clusters instead of specifying a number in advance. privacy preservation [28, ***]<3>: in a privacy preservation application, we may want to release personal records to the public without a privacy breach. to achieve this, we can group records into small clusters and release the summary of each cluster to the public. in this context, the usability of a clustering is evaluated by how much privacy is preserved in the clustering. to preserve individual privacy, the k-anonymity model [***]<3> requires that each cluster has to contain at least a certain number of individuals. however, these individuals could have very similar, even identical attribute values, allowing an adversary to accurately estimate their sensitive attribute values with high confidence from the summary influence:2 type:2 pair index:320 citer id:354 citer title:constraint-driven clustering citer abstract:clustering methods can be either data-driven or need-driven. data-driven methods intend to discover the true structure of the underlying data while need-driven methods aims at organizing the true structure to meet certain application requirements. thus, need-driven (e.g. constrained) clustering is able to find more useful and actionable clusters in applications such as energy aware sensor networks, privacy preservation, and market segmentation. however, the existing methods of constrained clustering require users to provide the number of clusters, which is often unknown in advance, but has a crucial impact on the clustering result. in this paper, we argue that a more natural way to generate actionable clusters is to let the application-specific constraints decide the number of clusters. for this purpose, we introduce a novel cluster model, constraint-driven clustering (cdc), which finds an a priori unspecified number of compact clusters that satisfy all user-provided constraints. two general types of constraints are considered, i.e. minimum significance constraints and minimum variance constraints, as well as combinations of these two types. we prove the np-hardness of the cdc problem with different constraints. we propose a novel dynamic data structure, the cd-tree, which organizes data points in leaf nodes such that each leaf node approximately satisfies the cdc constraints and minimizes the objective function. based on cd-trees, we develop an efficient algorithm to solve the new clustering problem. our experimental evaluation on synthetic and real datasets demonstrates the quality of the generated clusters and the scalability of the algorithm. categories and subject descriptors citee id:353 citee title:constraint-based clustering in large databases citee abstract:constrained clustering �finding clusters that satisfy user-specified constraints �is highly desirable in many applications. in this paper, we introduce the constrained clustering problem and show that traditional clustering algorithms (e.g., k-means) cannot handle it. a scalable constraint-clustering algorithm is developed in this study which starts by finding an initial solution that satisfies user-specified constraints and then refines the solution by performing confined object movements under constraints. our algorithm consists of two phases: pivot movement and deadlock resolution. for both phases, we show that finding the optimal solution is np-hard. we then propose several heuristics and show how our algorithm can scale up for large data sets using the heuristic of micro-cluster sharing. by experiments, we show the effectiveness and efficiency of the heuristics. surrounding text:for example, in market segmentation, relatively balanced customer groups are more preferable so that the knowledge extracted from each group has equal significance and are thus easier to evaluate [19]<2>. the special requirement of identifying balanced clusters can be effectively captured by imposing balancing constraints [9, ***, 7, 35]<2>. these models enable us to find useful clusters. cluster-level constraints. the research on clustering with constraints was introduced by [9]<2> and systematically studied in [***]<2>. both clustering models aim at partitioning data points into k clusters while each cluster satisfies a significance constraint. tung et al. [***]<2> propose to solve this problem by starting with any valid clustering. the solution is repeatedly refined by moving some objects between clusters to reduce the clustering cost and maintaining the constraint satisfaction at the same time influence:1 type:2 pair index:321 citer id:354 citer title:constraint-driven clustering citer abstract:clustering methods can be either data-driven or need-driven. data-driven methods intend to discover the true structure of the underlying data while need-driven methods aims at organizing the true structure to meet certain application requirements. thus, need-driven (e.g. constrained) clustering is able to find more useful and actionable clusters in applications such as energy aware sensor networks, privacy preservation, and market segmentation. however, the existing methods of constrained clustering require users to provide the number of clusters, which is often unknown in advance, but has a crucial impact on the clustering result. in this paper, we argue that a more natural way to generate actionable clusters is to let the application-specific constraints decide the number of clusters. for this purpose, we introduce a novel cluster model, constraint-driven clustering (cdc), which finds an a priori unspecified number of compact clusters that satisfy all user-provided constraints. two general types of constraints are considered, i.e. minimum significance constraints and minimum variance constraints, as well as combinations of these two types. we prove the np-hardness of the cdc problem with different constraints. we propose a novel dynamic data structure, the cd-tree, which organizes data points in leaf nodes such that each leaf node approximately satisfies the cdc constraints and minimizes the objective function. based on cd-trees, we develop an efficient algorithm to solve the new clustering problem. our experimental evaluation on synthetic and real datasets demonstrates the quality of the generated clusters and the scalability of the algorithm. categories and subject descriptors citee id:306 citee title:clustering with instance-level constraints citee abstract:clustering algorithms conduct a search through the space of possible organizations of a data set. in this paper, we propose two types of instance-level clustering constraints { must-link and cannot-link constraints { and show how they can be incorporated into a clustering algorithm to aid that search. for three of the four data sets tested, our results indicate that the incorporation of surprisingly few such constraints can increase clustering accuracy while decreasing runtime. we also investigate the relative e ects of each type of constraint and find that the type that contributes most to accuracy improvements depends on the behavior of the clustering algorithm without constraints. surrounding text:research track paper 321 instance-level constraints. work on instance level constraints and clustering were first presented to the machine learning and data mining communities by wagstaff and cardie [***]<2>. in this line of work the constraints are on instances either forcing them to be in the same cluster (must-link) or different clusters (cannot-link) influence:1 type:2 pair index:322 citer id:354 citer title:constraint-driven clustering citer abstract:clustering methods can be either data-driven or need-driven. data-driven methods intend to discover the true structure of the underlying data while need-driven methods aims at organizing the true structure to meet certain application requirements. thus, need-driven (e.g. constrained) clustering is able to find more useful and actionable clusters in applications such as energy aware sensor networks, privacy preservation, and market segmentation. however, the existing methods of constrained clustering require users to provide the number of clusters, which is often unknown in advance, but has a crucial impact on the clustering result. in this paper, we argue that a more natural way to generate actionable clusters is to let the application-specific constraints decide the number of clusters. for this purpose, we introduce a novel cluster model, constraint-driven clustering (cdc), which finds an a priori unspecified number of compact clusters that satisfy all user-provided constraints. two general types of constraints are considered, i.e. minimum significance constraints and minimum variance constraints, as well as combinations of these two types. we prove the np-hardness of the cdc problem with different constraints. we propose a novel dynamic data structure, the cd-tree, which organizes data points in leaf nodes such that each leaf node approximately satisfies the cdc constraints and minimizes the objective function. based on cd-trees, we develop an efficient algorithm to solve the new clustering problem. our experimental evaluation on synthetic and real datasets demonstrates the quality of the generated clusters and the scalability of the algorithm. categories and subject descriptors citee id:368 citee title:drawing planar graphs citee abstract:the book presents the important fundamental theorems and algorithms on planar graph drawing with easy-to-understand and constructive proofs. extensively illustrated and with exercises included at the end of each chapter, it is suitable for use in advanced undergraduate and graduate level courses on algorithms, graph theory, graph drawing, information visualization and computational geometry. the book will also serve as a useful reference source for researchers in the field of graph drawing and software developers in information visualization, vlsi design and cad surrounding text:the resulting layout is denoted as l. [***]<3> proposed a linear time algorithm to compute such a layout. a similar rectilinear layout is used in [13]<2> to prove the np-completeness of the feasibility problem for the must-link and  constraints influence:3 type:3 pair index:323 citer id:354 citer title:constraint-driven clustering citer abstract:clustering methods can be either data-driven or need-driven. data-driven methods intend to discover the true structure of the underlying data while need-driven methods aims at organizing the true structure to meet certain application requirements. thus, need-driven (e.g. constrained) clustering is able to find more useful and actionable clusters in applications such as energy aware sensor networks, privacy preservation, and market segmentation. however, the existing methods of constrained clustering require users to provide the number of clusters, which is often unknown in advance, but has a crucial impact on the clustering result. in this paper, we argue that a more natural way to generate actionable clusters is to let the application-specific constraints decide the number of clusters. for this purpose, we introduce a novel cluster model, constraint-driven clustering (cdc), which finds an a priori unspecified number of compact clusters that satisfy all user-provided constraints. two general types of constraints are considered, i.e. minimum significance constraints and minimum variance constraints, as well as combinations of these two types. we prove the np-hardness of the cdc problem with different constraints. we propose a novel dynamic data structure, the cd-tree, which organizes data points in leaf nodes such that each leaf node approximately satisfies the cdc constraints and minimizes the objective function. based on cd-trees, we develop an efficient algorithm to solve the new clustering problem. our experimental evaluation on synthetic and real datasets demonstrates the quality of the generated clusters and the scalability of the algorithm. categories and subject descriptors citee id:283 citee title:birch: an efficient data clustering method for very large databases citee abstract:finding useful patterns in large datasets has attractedconsiderable interest recently, and one of the most widelystudied problems in this area is the identification of clusters,or densely populated regions, in a multi-dimensional datasetfiprior work does not adequately address the problem of largedatasets and minimization of i/o costsfithis paper presents a data clustering method namedbirch (balanced iterative reducing and clustering usinghierarchies), and demonstrates that it is surrounding text:thus, easily retrieving the local neighborhood of data points is critical to the design of a universal algorithm that can handle different constraints. guided by these observations, we propose an algorithm based on a novel data structure, called the cd-tree, which is similar to the b-tree and cf-tree [***]<3>. 4 influence:3 type:1 pair index:324 citer id:354 citer title:constraint-driven clustering citer abstract:clustering methods can be either data-driven or need-driven. data-driven methods intend to discover the true structure of the underlying data while need-driven methods aims at organizing the true structure to meet certain application requirements. thus, need-driven (e.g. constrained) clustering is able to find more useful and actionable clusters in applications such as energy aware sensor networks, privacy preservation, and market segmentation. however, the existing methods of constrained clustering require users to provide the number of clusters, which is often unknown in advance, but has a crucial impact on the clustering result. in this paper, we argue that a more natural way to generate actionable clusters is to let the application-specific constraints decide the number of clusters. for this purpose, we introduce a novel cluster model, constraint-driven clustering (cdc), which finds an a priori unspecified number of compact clusters that satisfy all user-provided constraints. two general types of constraints are considered, i.e. minimum significance constraints and minimum variance constraints, as well as combinations of these two types. we prove the np-hardness of the cdc problem with different constraints. we propose a novel dynamic data structure, the cd-tree, which organizes data points in leaf nodes such that each leaf node approximately satisfies the cdc constraints and minimizes the objective function. based on cd-trees, we develop an efficient algorithm to solve the new clustering problem. our experimental evaluation on synthetic and real datasets demonstrates the quality of the generated clusters and the scalability of the algorithm. categories and subject descriptors citee id:369 citee title:scalable, balanced model-based clustering citee abstract:this paper presents a general framework for adapting any generative (model-based) clustering algorithm to provide balanced solutions, i.e., clusters of comparable sizes. partitional, model-based clustering algorithms are viewed as an iterative two-step optimization process—iterative model re-estimation and sample re-assignment. instead of a maximum-likelihood (ml) assignment, a balanceconstrained approach is used for the sample assignment step. an efficient iterative bipartitioning heuristic is developed to reduce the computational complexity of this step and make the balanced sample assignment algorithm scalable to large datasets. we demonstrate the superiority of this approach to regular ml clustering on complex data such as arbitraryshape 2-d spatial data, high-dimensional text documents, and eeg time series. keywords: model-based clustering, scalable algorithms, balanced clustering, constrained clustering surrounding text:for example, in market segmentation, relatively balanced customer groups are more preferable so that the knowledge extracted from each group has equal significance and are thus easier to evaluate [19]<2>. the special requirement of identifying balanced clusters can be effectively captured by imposing balancing constraints [9, 31, 7, ***]<2>. these models enable us to find useful clusters influence:3 type:2 pair index:325 citer id:355 citer title:max-min d-cluster formation in wireless ad hoc networks citer abstract:an ad hoc network may be logically represented as a set of clusters. the clusterheads form a d-hop dominating set. each node is at most d hops from a clusterhead. clusterheads form a virtual backbone and may be used to route packets for nodes in their cluster. previous heuristics restricted themselves to 1-hop clusters. we show that the minimum d-hop dominating set problem is np-complete. then we present a heuristic to form d-clusters in a wireless ad hoc network. nodes are assumed to have non-deterministic mobility pattern. clusters are formed by diffusing node identities along the wireless links. when the heuristic terminates, a node either becomes a clusterhead, or is at most d wireless hops away from its clusterhead. the value of d is a parameter of the heuristic. the heuristic can be run either at regular intervals, or whenever the network configuration changes. one of the features of the heuristic is that it tends to re-elect existing clusterheads even when the network configuration changes. this helps to reduce the communication overheads during transition from old clusterheads to new clusterheads. also, there is a tendency to evenly distribute the mobile nodes among the clusterheads, and evently distribute the responsibility of acting as clusterheads among all nodes. thus, the heuristic is fair and stable. simulation experiments demonstrate that the proposed heuristic is better than the two earlier heuristics, namely the lca and degree based solutions citee id:712 citee title:the architectural organization of a mobile radio network via a distributed algorithm citee abstract:in this paper we consider the problem of organizing a set of mobile, radio-equipped nodes into a connected network. we require that a reliable structure be acquired and maintained in the face of arbitrary topological changes due to node motion and/or failure. we also require that such a structure be achieved without the use of a central controller. we propose and develop a self-starting, distributed algorithm that establishes and maintains such a connected architecture. this algorithm is especially suited to the needs of the hf intra-task force (itf) communication network, which is discussed in the paper surrounding text:thus, the heuristic is fair and stable. simulation experiments demonstrate that the proposed heuristic is better than the two earlier heuristics, namely the lca [***]<2> and degree based [11]<2> solutions. i. introduction ad hoc networks (also referred to as packet radio networks) consist of nodes that move freely and communicate with other nodes via wireless links. one way to support efficient communication between nodes is to develop a wireless backbone architecture [***]<2>, [2]<2>, [4]<2>, [8]<2>. while all nodes are identical in their capabilities, certain nodes are elected to form the backbone. due to the mobility of nodes in an ad hoc network, the backbone must be continuously reconstructed in a timely fashion, as the nodes move away from their associated clusterheads. the election of clusterheads has been a topic of many papers as described in [***]<2>, [2]<2>, [8]<2>. in all of these papers the leader election guarantees that no node will be more than one hop away from a leader. additionally, the heuristic elects clusterheads in such a manner as to favor their re-election in future rounds, thereby reducing transition overheads when old clusterheads give way to new clusterheads, however, it is also fair as a large number of nodes equally share the responsibility for acting as clusterheads. furthermore, this heuristic has time complexity of o(d) rounds which compares favorably to o(n) for earlier heuristics [***]<2>, [4]<2> for large mobile networks. this reduction in time complexity is obtained by increasing the concurrency in communication. typically, backbones are constructed to connect neighborhoods in the network. past solutions of this kind have created a hierarchy where every node in the network was no more than 1 hop away from a clusterhead [***]<2>, [4]<2>, [10]<2>. in large networks this approach may generate a large number of clusterheads and eventually lead to the same problem as stated in the first design approach. furthermore, some of the previous clustering solutions have relied on synchronous clocks for exchange of data between nodes. in the linked cluster algorithm [***]<2>, lca, nodes communicate using tdma frames. each frame has a slot for each node in the network to communicate, avoiding collisions. therefore, a node must realize not only if it is the largest in its d-neighborhood but also if it is the largest in any other node’s d-neighborhood. this is similar to the strategy employed in [***]<2>. the second stage uses d rounds of floodmin to propagate the smaller node ids that have not been overtaken. thus, the storage complexity is o(d). this compares favorably with heuristics like [***]<2>, [2]<2> where identities of all neighboring nodes is maintained and the storage complexity is o(n). viii. viii. simulation experiments and results we conducted simulation experiments to evaluate the performance of the proposed heuristic and compare these finding against three heuristics, the original linked cluster algorithm (lca) [***]<2>, the revised linked cluster algorithm (lca2) [5]<2>, and the highest-connectivity (degree) [11]<2>, [8]<2> heuristic. we assumed a variety of systems running with 100, 200, 400, and 600 nodes to simulate ad hoc networks with varying levels of node density influence:1 type:2 pair index:326 citer id:355 citer title:max-min d-cluster formation in wireless ad hoc networks citer abstract:an ad hoc network may be logically represented as a set of clusters. the clusterheads form a d-hop dominating set. each node is at most d hops from a clusterhead. clusterheads form a virtual backbone and may be used to route packets for nodes in their cluster. previous heuristics restricted themselves to 1-hop clusters. we show that the minimum d-hop dominating set problem is np-complete. then we present a heuristic to form d-clusters in a wireless ad hoc network. nodes are assumed to have non-deterministic mobility pattern. clusters are formed by diffusing node identities along the wireless links. when the heuristic terminates, a node either becomes a clusterhead, or is at most d wireless hops away from its clusterhead. the value of d is a parameter of the heuristic. the heuristic can be run either at regular intervals, or whenever the network configuration changes. one of the features of the heuristic is that it tends to re-elect existing clusterheads even when the network configuration changes. this helps to reduce the communication overheads during transition from old clusterheads to new clusterheads. also, there is a tendency to evenly distribute the mobile nodes among the clusterheads, and evently distribute the responsibility of acting as clusterheads among all nodes. thus, the heuristic is fair and stable. simulation experiments demonstrate that the proposed heuristic is better than the two earlier heuristics, namely the lca and degree based solutions citee id:5 citee title:a better heuristic for orthogonal graph drawing citee abstract:an orthogonal drawing of a graph is an embedding in the plane such that all edges are drawn as sequences of horizontal and vertical segments. we present a linear time and space algorithm to draw any connected graph orthogonally on a grid of size n \theta n with at most 2n + 2 bends. each edge is bent at most twice. in particular for non-planar and non-biconnected planar graphs, this is a big improvement. the algorithm is very simple, easy to implement, and it handles both planar and non-planar surrounding text:to this end, we make use of the following result which shows how planar graphs can be efficiently embedded into the euclidian plane [16]<1>: a planar graph with maximum degree 4 can be embedded in the plane using o(jv j) area in such a way that its vertices are at integer coordinates and its edges are drawn so that they are made up of line segments of form x=i or y=j, for integers i and j. moreover, according to [***]<1> such embeddings can be constructed in linear time. thus, in constructing our reduction we may assume that we are given a graph g = (v influence:3 type:1 pair index:327 citer id:355 citer title:max-min d-cluster formation in wireless ad hoc networks citer abstract:an ad hoc network may be logically represented as a set of clusters. the clusterheads form a d-hop dominating set. each node is at most d hops from a clusterhead. clusterheads form a virtual backbone and may be used to route packets for nodes in their cluster. previous heuristics restricted themselves to 1-hop clusters. we show that the minimum d-hop dominating set problem is np-complete. then we present a heuristic to form d-clusters in a wireless ad hoc network. nodes are assumed to have non-deterministic mobility pattern. clusters are formed by diffusing node identities along the wireless links. when the heuristic terminates, a node either becomes a clusterhead, or is at most d wireless hops away from its clusterhead. the value of d is a parameter of the heuristic. the heuristic can be run either at regular intervals, or whenever the network configuration changes. one of the features of the heuristic is that it tends to re-elect existing clusterheads even when the network configuration changes. this helps to reduce the communication overheads during transition from old clusterheads to new clusterheads. also, there is a tendency to evenly distribute the mobile nodes among the clusterheads, and evently distribute the responsibility of acting as clusterheads among all nodes. thus, the heuristic is fair and stable. simulation experiments demonstrate that the proposed heuristic is better than the two earlier heuristics, namely the lca and degree based solutions citee id:713 citee title:routing in ad-hoc networks using minimum connected dominating sets citee abstract:we impose a virtual backbone structure on the ad-hoc network, in order to support unicast, multicast, and fault-tolerant routing within the ad-hoc network. this virtual backbone differs from the wired backbone of cellular networks in two key ways: (a) it may change as nodes move, and (b) it is not used primarily for routing packets or flows, but only for computing and updating routes. the primary routes for packets and flows are still computed by a shortest-paths computation; the virtual backbone can, if necessary provide backup routes to handle interim failures. because of the dynamic nature of the virtual backbone, our approach splits the routing problem into two levels: (a) find and update the virtual backbone, and (b) then find and update routes. the key contribution of this paper is to describe several alternatives for the first part of finding and updating the virtual backbone. to keep the virtual backbone as small as possible we use an approximation to the minimum connected dominating set (mcds) of the ad-hoc network topology as the virtual backbone. the hosts in the mcds maintain local copies of the global topology of the network, along with shortest paths between all pairs of nodes surrounding text:introduction ad hoc networks (also referred to as packet radio networks) consist of nodes that move freely and communicate with other nodes via wireless links. one way to support efficient communication between nodes is to develop a wireless backbone architecture [1]<2>, [2]<2>, [***]<2>, [8]<2>. while all nodes are identical in their capabilities, certain nodes are elected to form the backbone. additionally, the heuristic elects clusterheads in such a manner as to favor their re-election in future rounds, thereby reducing transition overheads when old clusterheads give way to new clusterheads, however, it is also fair as a large number of nodes equally share the responsibility for acting as clusterheads. furthermore, this heuristic has time complexity of o(d) rounds which compares favorably to o(n) for earlier heuristics [1]<2>, [***]<2> for large mobile networks. this reduction in time complexity is obtained by increasing the concurrency in communication. typically, backbones are constructed to connect neighborhoods in the network. past solutions of this kind have created a hierarchy where every node in the network was no more than 1 hop away from a clusterhead [1]<2>, [***]<2>, [10]<2>. in large networks this approach may generate a large number of clusterheads and eventually lead to the same problem as stated in the first design approach influence:2 type:2 pair index:328 citer id:355 citer title:max-min d-cluster formation in wireless ad hoc networks citer abstract:an ad hoc network may be logically represented as a set of clusters. the clusterheads form a d-hop dominating set. each node is at most d hops from a clusterhead. clusterheads form a virtual backbone and may be used to route packets for nodes in their cluster. previous heuristics restricted themselves to 1-hop clusters. we show that the minimum d-hop dominating set problem is np-complete. then we present a heuristic to form d-clusters in a wireless ad hoc network. nodes are assumed to have non-deterministic mobility pattern. clusters are formed by diffusing node identities along the wireless links. when the heuristic terminates, a node either becomes a clusterhead, or is at most d wireless hops away from its clusterhead. the value of d is a parameter of the heuristic. the heuristic can be run either at regular intervals, or whenever the network configuration changes. one of the features of the heuristic is that it tends to re-elect existing clusterheads even when the network configuration changes. this helps to reduce the communication overheads during transition from old clusterheads to new clusterheads. also, there is a tendency to evenly distribute the mobile nodes among the clusterheads, and evently distribute the responsibility of acting as clusterheads among all nodes. thus, the heuristic is fair and stable. simulation experiments demonstrate that the proposed heuristic is better than the two earlier heuristics, namely the lca and degree based solutions citee id:31 citee title:a design concept for reliable mobile radio networks with frequency hopping signaling citee abstract:a new architecture for mobile radio networks, called the linked cluster architecture, is described, and methods for implementing this architecture using distributed control techniques are presented. we illustrate how fully distributed control methods can be combined with hierarchical control to create a network that is robust with respect to both node loss and connectivity changes. two distributed algorithms are presented that deal with the formation and linkage of clusters and the activation of the network links. to study the performance of our network structuring algorithms, a simulation model was developed. the use of simulation to construct software simulation tools is illustrated. simulation results are shown for the example of a high frequency (fif) intratask force (hf) communication network surrounding text:a node x becomes a clusterhead if at least one of the following conditions is satisfied: (i) x has the highest identity among all nodes within 1 wireless hop of it, (ii) x does not have the highest identity in its 1-hop neighborhood, but there exists at least one neighboring node y such that x is the highest identity node in y’s 1-hop neighborhood. later the lca heuristic was revised [***]<2> to decrease the number of clusterheads produced in the original lca. in this revised edition of lca (lca2) a node is said to be covered if it is in the 1-hop neighborhood of a node that has declared itself to be a clusterhead. viii. simulation experiments and results we conducted simulation experiments to evaluate the performance of the proposed heuristic and compare these finding against three heuristics, the original linked cluster algorithm (lca) [1]<2>, the revised linked cluster algorithm (lca2) [***]<2>, and the highest-connectivity (degree) [11]<2>, [8]<2> heuristic. we assumed a variety of systems running with 100, 200, 400, and 600 nodes to simulate ad hoc networks with varying levels of node density influence:2 type:2 pair index:329 citer id:355 citer title:max-min d-cluster formation in wireless ad hoc networks citer abstract:an ad hoc network may be logically represented as a set of clusters. the clusterheads form a d-hop dominating set. each node is at most d hops from a clusterhead. clusterheads form a virtual backbone and may be used to route packets for nodes in their cluster. previous heuristics restricted themselves to 1-hop clusters. we show that the minimum d-hop dominating set problem is np-complete. then we present a heuristic to form d-clusters in a wireless ad hoc network. nodes are assumed to have non-deterministic mobility pattern. clusters are formed by diffusing node identities along the wireless links. when the heuristic terminates, a node either becomes a clusterhead, or is at most d wireless hops away from its clusterhead. the value of d is a parameter of the heuristic. the heuristic can be run either at regular intervals, or whenever the network configuration changes. one of the features of the heuristic is that it tends to re-elect existing clusterheads even when the network configuration changes. this helps to reduce the communication overheads during transition from old clusterheads to new clusterheads. also, there is a tendency to evenly distribute the mobile nodes among the clusterheads, and evently distribute the responsibility of acting as clusterheads among all nodes. thus, the heuristic is fair and stable. simulation experiments demonstrate that the proposed heuristic is better than the two earlier heuristics, namely the lca and degree based solutions citee id:702 citee title:maca-bi (maca by invitation) citee abstract:this paper introduces a new wireless mac protocol, maca-bi (maca by invitation). the protocol is a simplified version of the well known maca (multiple access collision avoidance) based on the request to send/clear to send (rts/cts) handshake. the clear to send (cts) control message is retained, while the request to send (rts) part of the rts/cts handshake is suppressed. maca-bi, preserving the data collision free property, is more robust than maca to problems such as protocol failures surrounding text:maca utilizes a request to send/clear to send (rts/cts) handshaking to avoid collision between nodes. a modified maca protocol, maca-bi (by invitation) [***]<2>, suppresses all rts and relies solely on cts, invitations to transmit data. simulation experiments show maca-bi to be superior to maca and csma in multi-hop networks. the time complexity and the number of transmissions required to achieve a local broadcast (to all neighbors) for a single round is dependent on the success of the data link layer protocol. while the maca-bi has been shown to be superior to maca and csma [***]<2> it still suffers from the hidden terminal problem and may require re-transmissions to complete a round. a data link protocol similar to maca-bi that resolves completely the hidden terminal problem is an area of additional research and not the intent of this paper influence:3 type:2 pair index:330 citer id:355 citer title:max-min d-cluster formation in wireless ad hoc networks citer abstract:an ad hoc network may be logically represented as a set of clusters. the clusterheads form a d-hop dominating set. each node is at most d hops from a clusterhead. clusterheads form a virtual backbone and may be used to route packets for nodes in their cluster. previous heuristics restricted themselves to 1-hop clusters. we show that the minimum d-hop dominating set problem is np-complete. then we present a heuristic to form d-clusters in a wireless ad hoc network. nodes are assumed to have non-deterministic mobility pattern. clusters are formed by diffusing node identities along the wireless links. when the heuristic terminates, a node either becomes a clusterhead, or is at most d wireless hops away from its clusterhead. the value of d is a parameter of the heuristic. the heuristic can be run either at regular intervals, or whenever the network configuration changes. one of the features of the heuristic is that it tends to re-elect existing clusterheads even when the network configuration changes. this helps to reduce the communication overheads during transition from old clusterheads to new clusterheads. also, there is a tendency to evenly distribute the mobile nodes among the clusterheads, and evently distribute the responsibility of acting as clusterheads among all nodes. thus, the heuristic is fair and stable. simulation experiments demonstrate that the proposed heuristic is better than the two earlier heuristics, namely the lca and degree based solutions citee id:440 citee title:distributed algorithms for generating loopfree routes in networks with frequently changing topology citee abstract:we consider the problem of maintaining communication between the nodes of a data network and a central station in the presence of frequent topological changes as, for example, in mobile packet radio networks. we argue that flooding schemes have significant drawbacks for such networks, and propose ag eneral class of distributed algorithms for establishing new loop-free routes to the station for any node left without a route due to china nthge sn etwork topology. by virtue of built-in redundancy, the algorithms are typically activatedv ery infrequently and, evenw hen they are, they do not involve any communication within the portoiof nth e network that has not heen materially affected by a topological change. surrounding text:previous work and design choices there are two heuristic design approaches for management of ad hoc networks. the first choice is to have all nodes maintain knowledge of the network and manage themselves [***]<2>, [12]<2>, [13]<2>. this circumvents the need to select leaders or develop clusters influence:3 type:2 pair index:331 citer id:355 citer title:max-min d-cluster formation in wireless ad hoc networks citer abstract:an ad hoc network may be logically represented as a set of clusters. the clusterheads form a d-hop dominating set. each node is at most d hops from a clusterhead. clusterheads form a virtual backbone and may be used to route packets for nodes in their cluster. previous heuristics restricted themselves to 1-hop clusters. we show that the minimum d-hop dominating set problem is np-complete. then we present a heuristic to form d-clusters in a wireless ad hoc network. nodes are assumed to have non-deterministic mobility pattern. clusters are formed by diffusing node identities along the wireless links. when the heuristic terminates, a node either becomes a clusterhead, or is at most d wireless hops away from its clusterhead. the value of d is a parameter of the heuristic. the heuristic can be run either at regular intervals, or whenever the network configuration changes. one of the features of the heuristic is that it tends to re-elect existing clusterheads even when the network configuration changes. this helps to reduce the communication overheads during transition from old clusterheads to new clusterheads. also, there is a tendency to evenly distribute the mobile nodes among the clusterheads, and evently distribute the responsibility of acting as clusterheads among all nodes. thus, the heuristic is fair and stable. simulation experiments demonstrate that the proposed heuristic is better than the two earlier heuristics, namely the lca and degree based solutions citee id:714 citee title:multicluster, mobile, multimedia radio network citee abstract:this paper, we present a multi-cluster, multi-hop packet radio network architecturethat addresses the above challenges and implements all required featuresfiin a mobile, multi-channel (code), and multi-hop environment, the topology isdynamically reconfigured to handle mobilityfi routing and bandwidth assignmentare designed so as to meet the various types of traffic requirementsfi node clustering,vc setup and channel access control are the underlying features whichsupport this architecture, surrounding text:introduction ad hoc networks (also referred to as packet radio networks) consist of nodes that move freely and communicate with other nodes via wireless links. one way to support efficient communication between nodes is to develop a wireless backbone architecture [1]<2>, [2]<2>, [4]<2>, [***]<2>. while all nodes are identical in their capabilities, certain nodes are elected to form the backbone. due to the mobility of nodes in an ad hoc network, the backbone must be continuously reconstructed in a timely fashion, as the nodes move away from their associated clusterheads. the election of clusterheads has been a topic of many papers as described in [1]<2>, [2]<2>, [***]<2>. in all of these papers the leader election guarantees that no node will be more than one hop away from a leader. in the case of a tie, the lowest or highest id may be used. as the network topology changes this approach can result in a high turnover of clusterheads [***]<2>. this is undesirable due to the high overhead associated with clusterhead change over. viii. simulation experiments and results we conducted simulation experiments to evaluate the performance of the proposed heuristic and compare these finding against three heuristics, the original linked cluster algorithm (lca) [1]<2>, the revised linked cluster algorithm (lca2) [5]<2>, and the highest-connectivity (degree) [11]<2>, [***]<2> heuristic. we assumed a variety of systems running with 100, 200, 400, and 600 nodes to simulate ad hoc networks with varying levels of node density. this is not surprising for degree as it is based on degree of connectivity, not node id. as the network topology changes this approach can result in high turnover of clusterheads [***]<2>. similarly, in lca and lca2 a single link make or break may move a lower id node within or out of d-hops of a node x, forcing it to transition between clusterhead and normal node states influence:1 type:2 pair index:332 citer id:355 citer title:max-min d-cluster formation in wireless ad hoc networks citer abstract:an ad hoc network may be logically represented as a set of clusters. the clusterheads form a d-hop dominating set. each node is at most d hops from a clusterhead. clusterheads form a virtual backbone and may be used to route packets for nodes in their cluster. previous heuristics restricted themselves to 1-hop clusters. we show that the minimum d-hop dominating set problem is np-complete. then we present a heuristic to form d-clusters in a wireless ad hoc network. nodes are assumed to have non-deterministic mobility pattern. clusters are formed by diffusing node identities along the wireless links. when the heuristic terminates, a node either becomes a clusterhead, or is at most d wireless hops away from its clusterhead. the value of d is a parameter of the heuristic. the heuristic can be run either at regular intervals, or whenever the network configuration changes. one of the features of the heuristic is that it tends to re-elect existing clusterheads even when the network configuration changes. this helps to reduce the communication overheads during transition from old clusterheads to new clusterheads. also, there is a tendency to evenly distribute the mobile nodes among the clusterheads, and evently distribute the responsibility of acting as clusterheads among all nodes. thus, the heuristic is fair and stable. simulation experiments demonstrate that the proposed heuristic is better than the two earlier heuristics, namely the lca and degree based solutions citee id:239 citee title:approximation algorithms for combinatorial problems citee abstract:simple, polynomial-time, heuristic algorithms for finding approximate solutions to various polynomial complete optimization problems are analyzed with respect to their worst case behavior, measured by the ratio of the worst solution value that can be chosen by the algorithm to the optimal value. for certain problems, such as a simple form of the knapsack problem and an optimization problem based on satisfiability testing, there are algorithms for which this ratio is bounded by a constant, independent of the problem size. for a number of set covering problems, simple algorithms yield worst case ratios which can grow with the log of the problem size. and for the problem of finding the maximum clique in a graph, no algorithm has been found for which the ratio does not grow at least as fast as 0(n&egr;), where n is the problem size and &egr;> 0 depends on the algorithm. surrounding text:proof: since it is obvious that the minimum d-hops dominating set problem is in np, it remains to show that it is np-hard. we will construct a reduction from the (1-hop) dominating set problem for planar graphs with maximum degree 3 which was shown to be np-complete in [***]<3>. to this end, we make use of the following result which shows how planar graphs can be efficiently embedded into the euclidian plane [16]<1>: a planar graph with maximum degree 4 can be embedded in the plane using o(jv j) area in such a way that its vertices are at integer coordinates and its edges are drawn so that they are made up of line segments of form x=i or y=j, for integers i and j influence:3 type:3 pair index:333 citer id:355 citer title:max-min d-cluster formation in wireless ad hoc networks citer abstract:an ad hoc network may be logically represented as a set of clusters. the clusterheads form a d-hop dominating set. each node is at most d hops from a clusterhead. clusterheads form a virtual backbone and may be used to route packets for nodes in their cluster. previous heuristics restricted themselves to 1-hop clusters. we show that the minimum d-hop dominating set problem is np-complete. then we present a heuristic to form d-clusters in a wireless ad hoc network. nodes are assumed to have non-deterministic mobility pattern. clusters are formed by diffusing node identities along the wireless links. when the heuristic terminates, a node either becomes a clusterhead, or is at most d wireless hops away from its clusterhead. the value of d is a parameter of the heuristic. the heuristic can be run either at regular intervals, or whenever the network configuration changes. one of the features of the heuristic is that it tends to re-elect existing clusterheads even when the network configuration changes. this helps to reduce the communication overheads during transition from old clusterheads to new clusterheads. also, there is a tendency to evenly distribute the mobile nodes among the clusterheads, and evently distribute the responsibility of acting as clusterheads among all nodes. thus, the heuristic is fair and stable. simulation experiments demonstrate that the proposed heuristic is better than the two earlier heuristics, namely the lca and degree based solutions citee id:715 citee title:spatial reuse in multihop packet radio networks citee abstract:multihop packet radio networks present many challenging problems to the network analyst and designer. the communication channel, which must be shared by all of the network users, is the critical system resource. in order to make efficient use of this shared resource, a variety of channel access protocols to promote organized sharing have been investigated. sharing can occur in three domains: frequency, time, and space. this paper is mostly concerned with sharing and channel reuse in the spatial domain. a survey of results on approaches to topological design and associated channel access protocols that attempt to optimize system performance by spatial reuse of the communication channel is presented. surrounding text:simulation experiments show maca-bi to be superior to maca and csma in multi-hop networks. other protocols such as spatial tdma [***]<2> may be used to provide mac layer communication. spatial tdma provides deterministic performance that is good if the number of nodes is kept relatively small. typically, backbones are constructed to connect neighborhoods in the network. past solutions of this kind have created a hierarchy where every node in the network was no more than 1 hop away from a clusterhead [1]<2>, [4]<2>, [***]<2>. in large networks this approach may generate a large number of clusterheads and eventually lead to the same problem as stated in the first design approach influence:3 type:2 pair index:334 citer id:355 citer title:max-min d-cluster formation in wireless ad hoc networks citer abstract:an ad hoc network may be logically represented as a set of clusters. the clusterheads form a d-hop dominating set. each node is at most d hops from a clusterhead. clusterheads form a virtual backbone and may be used to route packets for nodes in their cluster. previous heuristics restricted themselves to 1-hop clusters. we show that the minimum d-hop dominating set problem is np-complete. then we present a heuristic to form d-clusters in a wireless ad hoc network. nodes are assumed to have non-deterministic mobility pattern. clusters are formed by diffusing node identities along the wireless links. when the heuristic terminates, a node either becomes a clusterhead, or is at most d wireless hops away from its clusterhead. the value of d is a parameter of the heuristic. the heuristic can be run either at regular intervals, or whenever the network configuration changes. one of the features of the heuristic is that it tends to re-elect existing clusterheads even when the network configuration changes. this helps to reduce the communication overheads during transition from old clusterheads to new clusterheads. also, there is a tendency to evenly distribute the mobile nodes among the clusterheads, and evently distribute the responsibility of acting as clusterheads among all nodes. thus, the heuristic is fair and stable. simulation experiments demonstrate that the proposed heuristic is better than the two earlier heuristics, namely the lca and degree based solutions citee id:716 citee title:selecting routers in ad-hoc wireless networks citee abstract:with the advent of portable lap tops and personal digital assistants it is becoming increas ingly important to enable communication over large networks with mobile usersfi while a cellular framework can go a long way in meeting this need it is has also become clear that it is necessary to support adfihoc architectures in which fixed base stations are not presentfi routing in an ad hoc network is supported by the client devices of the network ifiefi some of them have to act as routersfi this is essential since the range of transmitters may be quite limited less than m for ir links supporting mbffisfi the ieee fi commit tee on wireless local area networking has recognized ad hoc architectures as an important component of their standardization processfi in this paper we model an adhoc network as an undirected graphfi the nodes of the graph are processors that communicate along their incident edges by broadcastfi the processors do not know the size of the network and start out with no topological informationfi our goal is to select a small subset of the nodes of the graph that can act as routers for the ad hoc networkfi clearly the set of routers must be such that every other node is adjacent to at least one node in the subsetfi such a subset in graph theoretic terminology is called a dominating setfi finding the smallest dominating set of a graph is known to be np hardfi we present a fast distributed algorithm that finds a dominating set that is of size at most n ?? pm   for any network with n nodes and m links and that is provably close to the minimum cardinality setfi surrounding text:thus, the heuristic is fair and stable. simulation experiments demonstrate that the proposed heuristic is better than the two earlier heuristics, namely the lca [1]<2> and degree based [***]<2> solutions. i. additionally, it has been shown [15]<2> that as communications increase the amount of skew in a synchronous timer also increases, thereby degrading the performance of the overall system or introducing additional delay and overhead. other solutions base the election of clusterheads on degree of connectivity [***]<2>, not node id. each node broadcasts the nodes that it can hear, including itself. viii. simulation experiments and results we conducted simulation experiments to evaluate the performance of the proposed heuristic and compare these finding against three heuristics, the original linked cluster algorithm (lca) [1]<2>, the revised linked cluster algorithm (lca2) [5]<2>, and the highest-connectivity (degree) [***]<2>, [8]<2> heuristic. we assumed a variety of systems running with 100, 200, 400, and 600 nodes to simulate ad hoc networks with varying levels of node density influence:1 type:2 pair index:335 citer id:355 citer title:max-min d-cluster formation in wireless ad hoc networks citer abstract:an ad hoc network may be logically represented as a set of clusters. the clusterheads form a d-hop dominating set. each node is at most d hops from a clusterhead. clusterheads form a virtual backbone and may be used to route packets for nodes in their cluster. previous heuristics restricted themselves to 1-hop clusters. we show that the minimum d-hop dominating set problem is np-complete. then we present a heuristic to form d-clusters in a wireless ad hoc network. nodes are assumed to have non-deterministic mobility pattern. clusters are formed by diffusing node identities along the wireless links. when the heuristic terminates, a node either becomes a clusterhead, or is at most d wireless hops away from its clusterhead. the value of d is a parameter of the heuristic. the heuristic can be run either at regular intervals, or whenever the network configuration changes. one of the features of the heuristic is that it tends to re-elect existing clusterheads even when the network configuration changes. this helps to reduce the communication overheads during transition from old clusterheads to new clusterheads. also, there is a tendency to evenly distribute the mobile nodes among the clusterheads, and evently distribute the responsibility of acting as clusterheads among all nodes. thus, the heuristic is fair and stable. simulation experiments demonstrate that the proposed heuristic is better than the two earlier heuristics, namely the lca and degree based solutions citee id:94 citee title:a highly adaptive distributed routing algorithm for mobile wireless networks citee abstract:we present a new distributed routing protocol for mobile, multihop, wireless networks. the protocol is one of a family of protocols which we term “link reversal�algorithms. the protocol’s reaction is structured as a temporally-ordered sequence of diffusing computations; each computation consisting of a sequence of directed link reversals. the protocol is highly adaptive, efficient and scalable; being best-suited for use in large, dense, mobile networks. in these networks, the protocol’s reaction to link failures typically involves only a localized “single pass�of the distributed algorithm. this capability is unique among protocols which are stable in the face of network partitions, and results in the protocol’s high degree of adaptivity . this desirable behavior is achieved through the novel use of a “physical or logical clock�to establish the “temporal order�of topological change events which is used to structure (or order) the algorithm’s reaction to topological changes. we refer to the protocol as the temporally-ordered routing algorithm (tora) surrounding text:previous work and design choices there are two heuristic design approaches for management of ad hoc networks. the first choice is to have all nodes maintain knowledge of the network and manage themselves [7]<2>, [***]<2>, [13]<2>. this circumvents the need to select leaders or develop clusters influence:3 type:2 pair index:336 citer id:355 citer title:max-min d-cluster formation in wireless ad hoc networks citer abstract:an ad hoc network may be logically represented as a set of clusters. the clusterheads form a d-hop dominating set. each node is at most d hops from a clusterhead. clusterheads form a virtual backbone and may be used to route packets for nodes in their cluster. previous heuristics restricted themselves to 1-hop clusters. we show that the minimum d-hop dominating set problem is np-complete. then we present a heuristic to form d-clusters in a wireless ad hoc network. nodes are assumed to have non-deterministic mobility pattern. clusters are formed by diffusing node identities along the wireless links. when the heuristic terminates, a node either becomes a clusterhead, or is at most d wireless hops away from its clusterhead. the value of d is a parameter of the heuristic. the heuristic can be run either at regular intervals, or whenever the network configuration changes. one of the features of the heuristic is that it tends to re-elect existing clusterheads even when the network configuration changes. this helps to reduce the communication overheads during transition from old clusterheads to new clusterheads. also, there is a tendency to evenly distribute the mobile nodes among the clusterheads, and evently distribute the responsibility of acting as clusterheads among all nodes. thus, the heuristic is fair and stable. simulation experiments demonstrate that the proposed heuristic is better than the two earlier heuristics, namely the lca and degree based solutions citee id:717 citee title:universality considerations in vlsi circuits citee abstract:the problem of embedding the interconnection pattern of a circuit into a two-dimensional surface of minimal area is discussed. since even for some natural patterns graphs containing m connections may require omega (m super(2)) area, in order to achieve compact embeddings restricted classes of graphs have to be considered. for example, arbitrary trees (of bounded degree) can be embedded in linear area without edges crossing over. planar graphs can be embedded efficiently only if crossovers are allowed in the embedding. surrounding text:we will construct a reduction from the (1-hop) dominating set problem for planar graphs with maximum degree 3 which was shown to be np-complete in [9]<3>. to this end, we make use of the following result which shows how planar graphs can be efficiently embedded into the euclidian plane [***]<1>: a planar graph with maximum degree 4 can be embedded in the plane using o(jv j) area in such a way that its vertices are at integer coordinates and its edges are drawn so that they are made up of line segments of form x=i or y=j, for integers i and j. moreover, according to [3]<1> such embeddings can be constructed in linear time. thus, in constructing our reduction we may assume that we are given a graph g = (v. e) that is embedded in the plane according to [***]<1>. we construct in polynomial time a unit disk graph g0 = (v 0 influence:3 type:1 pair index:337 citer id:369 citer title:scalable, balanced model-based clustering citer abstract:this paper presents a general framework for adapting any generative (model-based) clustering algorithm to provide balanced solutions, i.e., clusters of comparable sizes. partitional, model-based clustering algorithms are viewed as an iterative two-step optimization process—iterative model re-estimation and sample re-assignment. instead of a maximum-likelihood (ml) assignment, a balanceconstrained approach is used for the sample assignment step. an efficient iterative bipartitioning heuristic is developed to reduce the computational complexity of this step and make the balanced sample assignment algorithm scalable to large datasets. we demonstrate the superiority of this approach to regular ml clustering on complex data such as arbitraryshape 2-d spatial data, high-dimensional text documents, and eeg time series. keywords: model-based clustering, scalable algorithms, balanced clustering, constrained clustering citee id:339 citee title:competitive learning algorithms for vector quantization citee abstract:the efficient representation and encoding of signals with limited resources, e.g., finite storage capacity and restricted transmission bandwidth, is a fundamental problem in technical as well as biological information processing systems. typically, under realistic circumstances, the encoding and communication of messages has to deal with different sources of noise and disturbances. in this paper, we propose a unifying approach to data compression by robust vector quantization, which explicitly deals with channel noise, bandwidth limitations, and random elimination of prototypes. the resulting algorithm is able to limit the detrimental e ect of noise in a very general communication scenario. in addition, the presented model allows us to derive a novel competitive neural networks algorithm, which covers topology preserving feature maps, the so-called neural-gas algorithm, and the maximum entropy soft-max rule as special cases. furthermore, continuation methods based on these noise models improve the codebook design by reducing the sensitivity to local minima. we show an exemplary application of the novel robust vector quantization algorithm to image compression for a teleconferencing system. surrounding text:the most notable type of clustering algorithms in terms of balanced solutions is graph partitioning [20]<2>, but it needs o(n2) computation just to compute the similarity matrix. certain online approaches such as frequency sensitive competitive learning [***]<2> can also be employed for improving balancing. a generative model based on a mixture of von mises-fisher distributions has been developed to characterize such approaches for normalized data [3]<2> influence:2 type:2 pair index:338 citer id:369 citer title:scalable, balanced model-based clustering citer abstract:this paper presents a general framework for adapting any generative (model-based) clustering algorithm to provide balanced solutions, i.e., clusters of comparable sizes. partitional, model-based clustering algorithms are viewed as an iterative two-step optimization process—iterative model re-estimation and sample re-assignment. instead of a maximum-likelihood (ml) assignment, a balanceconstrained approach is used for the sample assignment step. an efficient iterative bipartitioning heuristic is developed to reduce the computational complexity of this step and make the balanced sample assignment algorithm scalable to large datasets. we demonstrate the superiority of this approach to regular ml clustering on complex data such as arbitraryshape 2-d spatial data, high-dimensional text documents, and eeg time series. keywords: model-based clustering, scalable algorithms, balanced clustering, constrained clustering citee id:572 citee title:frequency sensitive competitive learning for clustering on high-dimensional hyperspheres citee abstract:—this paper derives three competitive learning mechanisms from first principles to obtain clusters of comparable sizes when both inputs and representatives are normalized. these mechanisms are very effective in achieving balanced grouping of inputs in high dimensional spaces, as illustrated by experimental results on clustering two popular text data sets in 26,099 and 21,839 dimensional spaces respectively. surrounding text:certain online approaches such as frequency sensitive competitive learning [1]<2> can also be employed for improving balancing. a generative model based on a mixture of von mises-fisher distributions has been developed to characterize such approaches for normalized data [***]<2>. the problem of clustering large scale data under constraints such as balancing has recently received attention in the data mining literature [31, 7, 35, 4]<2>. 12) k2: k-vmfs euclidean distance is not appropriate for clustering high dimensional normalized data such as text [34]<2>. a better metric used for text clustering is the cosine similarity, which can be derived from directional statistics �the von mises-fisher distribution [***]<2>. the general vmf distribution can be written as p(oj influence:2 type:2 pair index:339 citer id:369 citer title:scalable, balanced model-based clustering citer abstract:this paper presents a general framework for adapting any generative (model-based) clustering algorithm to provide balanced solutions, i.e., clusters of comparable sizes. partitional, model-based clustering algorithms are viewed as an iterative two-step optimization process—iterative model re-estimation and sample re-assignment. instead of a maximum-likelihood (ml) assignment, a balanceconstrained approach is used for the sample assignment step. an efficient iterative bipartitioning heuristic is developed to reduce the computational complexity of this step and make the balanced sample assignment algorithm scalable to large datasets. we demonstrate the superiority of this approach to regular ml clustering on complex data such as arbitraryshape 2-d spatial data, high-dimensional text documents, and eeg time series. keywords: model-based clustering, scalable algorithms, balanced clustering, constrained clustering citee id:743 citee title:model-based gaussian and non-gaussian clustering citee abstract:the classification maximum likelihood approach is sufficiently general to encompass many current clustering algorithms, including those based on the sum of squares criterion and on the criterion of friedman and rubin (1967). however, as currently implemented, it does not allow the specification of which features (orientation, size and shape) are to be common to all clusters and which may differ between clusters. also, it is restricted to gaussian distributions and it does not allow for noise. we propose ways of overcoming these limitations. a reparameterization of the covariance matrix allows us to specify that some features, but not all, be the same for all clusters. a practical framework for non-gaussian clustering is outlined, and a means of incorporating noise in the form of a poisson process is described. an approximate bayesian method for choosing the number of clusters is given. the performance of the proposed methods is studied by simulation, with encouraging results. the methods are applied to the analysis of a data set arising in the study of diabetes, and the results seem better than those of previous analyses. keywords include: bayes factors; classification; diabetes; hierarchical agglomeration; iterative relocation; mixture models surrounding text:there are several motivations behind our approach. first, probabilistic model-based clustering provides a principled and general approach to clustering [***]<1>. for example, the number of clusters may be estimated using bayesian model selection, though this is not done in this paper influence:2 type:1 pair index:340 citer id:369 citer title:scalable, balanced model-based clustering citer abstract:this paper presents a general framework for adapting any generative (model-based) clustering algorithm to provide balanced solutions, i.e., clusters of comparable sizes. partitional, model-based clustering algorithms are viewed as an iterative two-step optimization process—iterative model re-estimation and sample re-assignment. instead of a maximum-likelihood (ml) assignment, a balanceconstrained approach is used for the sample assignment step. an efficient iterative bipartitioning heuristic is developed to reduce the computational complexity of this step and make the balanced sample assignment algorithm scalable to large datasets. we demonstrate the superiority of this approach to regular ml clustering on complex data such as arbitraryshape 2-d spatial data, high-dimensional text documents, and eeg time series. keywords: model-based clustering, scalable algorithms, balanced clustering, constrained clustering citee id:90 citee title:a gentle tutorial of the em algorithm and its application to parameter estimation for gaussian mixture and hidden markov models citee abstract:we describe the maximum-likelihood parameter estimation problem and how the expectationmaximization(em) algorithm can be used for its solutionfi we first describe the form of the em algorithm as it is often given in the literaturefi we then develop the em parameterestimation procedure for two applications: 1) finding the parameters of a mixture ofgaussian densities, and 2) finding the parameters of a hidden markov model (hmm) (ifiefi,the baum-welch algorithm) for both discrete and surrounding text:keywords: model-based clustering, scalable algorithms, balanced clustering, constrained clustering 1 introduction clustering or segmentation of data is a fundamental data analysis step that has been widely studied across multiple disciplines for over 40 years [15]<1>. current clustering methods can be divided into generative (model-based) approaches [29, ***, 27]<1> and discriminative (similarity-based) approaches [36, 28, 14]<1>. parametric, model-based approaches attempt to learn generative models from the data, with each model corresponding to one particular cluster influence:3 type:2 pair index:341 citer id:369 citer title:scalable, balanced model-based clustering citer abstract:this paper presents a general framework for adapting any generative (model-based) clustering algorithm to provide balanced solutions, i.e., clusters of comparable sizes. partitional, model-based clustering algorithms are viewed as an iterative two-step optimization process—iterative model re-estimation and sample re-assignment. instead of a maximum-likelihood (ml) assignment, a balanceconstrained approach is used for the sample assignment step. an efficient iterative bipartitioning heuristic is developed to reduce the computational complexity of this step and make the balanced sample assignment algorithm scalable to large datasets. we demonstrate the superiority of this approach to regular ml clustering on complex data such as arbitraryshape 2-d spatial data, high-dimensional text documents, and eeg time series. keywords: model-based clustering, scalable algorithms, balanced clustering, constrained clustering citee id:349 citee title:constrained k-means clustering citee abstract:we consider practical methods for adding constraints to the k-means clustering algorithm in order to avoid local solutions with empty clusters or clusters having very few points. we often observe this phenomena when applying k-means to datasets where the number of dimensions is n  10 and the number of desired clusters is k  20. we propose explicitly adding k constraints to the underlying clustering optimization problem requiring that each cluster have at least a minimum number of points in it. we then investigate the resulting cluster assignment step. preliminary numerical tests on real datasets indicate the constrained approach is less prone to poor local solutions, producing a better summary of the underlying data. contrained k-means clustering surrounding text:unfortunately, k-means type algorithms (including the soft em variant [6]<2>, and birch [37]<2>) are increasingly prone to yielding imbalanced solutions as the input dimensionality increases. this problem is exacerbated when a large (tens or more) number of clusters are needed, and it is well known that both hard and soft k-means invariably result in some near-empty clusters in such scenarios [4, ***]<2>. while not an explicit goal in most clustering formulations, certain approaches such as top-down bisecting k-means [30]<2> tend to give more balanced solutions than others such as single-link agglomerative clustering. a generative model based on a mixture of von mises-fisher distributions has been developed to characterize such approaches for normalized data [3]<2>. the problem of clustering large scale data under constraints such as balancing has recently received attention in the data mining literature [31, ***, 35, 4]<2>. since balancing is a global property, it is difficult to obtain near-linear time techniques to achieve this goal while retaining high cluster quality. this algorithm has relatively low complexity of o(n log(n)) but relies on the assumption that the data itself is very balanced (for the sampling step to work). bradley, bennett and demiriz [***]<2> developed a constrained version of the kmeans algorithm. they constrained each cluster to be assigned at least a minimum number of data samples at each iteration. 10). it is an integer programming problem, which is np-hard in general$ fortunately, this integer programming problem is special in that it has the same optimum as its corresponding real relaxation [***]<2>, which algorithm: iterative greedy bipartitioning input: log-likelihood matrix wij = log p(oij. j). e. similar to [***]<2>, we can assign just the top m samples at each iteration and use ml assignment for the remaining data. this variation may be useful in situations where “near�balanced clustering is desired, but is not investigated in this paper influence:1 type:2 pair index:342 citer id:369 citer title:scalable, balanced model-based clustering citer abstract:this paper presents a general framework for adapting any generative (model-based) clustering algorithm to provide balanced solutions, i.e., clusters of comparable sizes. partitional, model-based clustering algorithms are viewed as an iterative two-step optimization process—iterative model re-estimation and sample re-assignment. instead of a maximum-likelihood (ml) assignment, a balanceconstrained approach is used for the sample assignment step. an efficient iterative bipartitioning heuristic is developed to reduce the computational complexity of this step and make the balanced sample assignment algorithm scalable to large datasets. we demonstrate the superiority of this approach to regular ml clustering on complex data such as arbitraryshape 2-d spatial data, high-dimensional text documents, and eeg time series. keywords: model-based clustering, scalable algorithms, balanced clustering, constrained clustering citee id:308 citee title:co-clustering documents and words using bipartite spectral graph partitioning citee abstract:both document clustering and word clustering are well studied problems. most existing algorithms cluster documents and words separately but not simultaneously. in this paper we present the novel idea of modeling the document collection as a bipartite graph between documents and words, using which the simultaneous clustering problem can be posed as a bipartite graph partitioning problem. to solve the partitioning problem, we use a new spectral co-clustering algorithm that uses the second left and right singular vectors of an appropriately scaled word-document matrix to yield good bipartitionings. the spectral algorithm enjoys some optimality properties; it can be shown that the singular vectors solve a real relaxation to the np-complete graph bipartitioning problem. we present experimental results to verify that the resulting co-clustering algorithm works well in practice surrounding text:in addition to application requirements, balanced clustering is sometimes also helpful because it tends to decrease sensitivity to initialization and to avoid outlier clusters (highly under-utilized representatives) from forming, and thus has a beneficial regularizing effect. in fact, balance is also an important constraint for spectral graph partitioning algorithms [***, 17, 24]<2>, which could give completely useless results if the objective function is just the minimum cut instead of a modified minimum cut that favors balanced clusters. unfortunately, k-means type algorithms (including the soft em variant [6]<2>, and birch [37]<2>) are increasingly prone to yielding imbalanced solutions as the input dimensionality increases influence:3 type:2 pair index:343 citer id:369 citer title:scalable, balanced model-based clustering citer abstract:this paper presents a general framework for adapting any generative (model-based) clustering algorithm to provide balanced solutions, i.e., clusters of comparable sizes. partitional, model-based clustering algorithms are viewed as an iterative two-step optimization process—iterative model re-estimation and sample re-assignment. instead of a maximum-likelihood (ml) assignment, a balanceconstrained approach is used for the sample assignment step. an efficient iterative bipartitioning heuristic is developed to reduce the computational complexity of this step and make the balanced sample assignment algorithm scalable to large datasets. we demonstrate the superiority of this approach to regular ml clustering on complex data such as arbitraryshape 2-d spatial data, high-dimensional text documents, and eeg time series. keywords: model-based clustering, scalable algorithms, balanced clustering, constrained clustering citee id:391 citee title:cure: an efficient clustering algorithm for large databases citee abstract:clustering, in data mining, is useful for discovering groups and identifying interesting distributions in the underlying data. traditional clustering algorithms either favor clusters with spherical shapes and similar sizes, or are very frag- ile in the presence of outliers. we propose a new cluster- ing algorithm called cure that is more robust to outliers, and identifies clusters having non-spherical shapes and wide variances in size. cure achieves this by representing each cluster by a certain fixed number of points that are gen- erated by selecting well scattered points from the cluster and then shrinking them toward the center of the cluster by a specified fraction. having more than one representa- tive point per cluster allows cure to adjust well to the geometry of non-spherical shapes and the shrinking helps to dampen the effects of outliers. to handle large databases, cure employs a combination of random sampling and par- titioning. a random sample drawn from the data set is first partitioned and each partition is partially clustered. the partial clusters are then clustered in a second pass to yield the desired clusters. our experimental results confirm that the quality of clusters produced by cure is much better than those found by existing algorithms. furthermore, they demonstrate that random sampling and partitioning enable cure to not only outperform existing algorithms but also to scale well for large databases without sacrificing cluster- ing quality. surrounding text:unfortunately, calculating the similarities between all pairs of data samples is computationally inefficient, requiring o(n2) time. so sampling techniques have been typically employed to scale such methods to large datasets [***, 4]<2>. in contrast, several model-based partitional approaches have a complexity of o(kn), where k is the number of clusters, and are thus more scalable influence:3 type:2 pair index:344 citer id:369 citer title:scalable, balanced model-based clustering citer abstract:this paper presents a general framework for adapting any generative (model-based) clustering algorithm to provide balanced solutions, i.e., clusters of comparable sizes. partitional, model-based clustering algorithms are viewed as an iterative two-step optimization process—iterative model re-estimation and sample re-assignment. instead of a maximum-likelihood (ml) assignment, a balanceconstrained approach is used for the sample assignment step. an efficient iterative bipartitioning heuristic is developed to reduce the computational complexity of this step and make the balanced sample assignment algorithm scalable to large datasets. we demonstrate the superiority of this approach to regular ml clustering on complex data such as arbitraryshape 2-d spatial data, high-dimensional text documents, and eeg time series. keywords: model-based clustering, scalable algorithms, balanced clustering, constrained clustering citee id:185 citee title:a sublinear-time approximation scheme for clustering in metric spaces citee abstract:the metric 2-clustering problem is defined as follows: given a metric (x; d), partition x into two sets s1 and s2 in order to minimize the value of \math\math d(u,v) i {u,v}\math si in this paper we show an approximation scheme for this problem surrounding text:keywords: model-based clustering, scalable algorithms, balanced clustering, constrained clustering 1 introduction clustering or segmentation of data is a fundamental data analysis step that has been widely studied across multiple disciplines for over 40 years [15]<1>. current clustering methods can be divided into generative (model-based) approaches [29, 6, 27]<1> and discriminative (similarity-based) approaches [36, 28, ***]<1>. parametric, model-based approaches attempt to learn generative models from the data, with each model corresponding to one particular cluster influence:3 type:2 pair index:345 citer id:369 citer title:scalable, balanced model-based clustering citer abstract:this paper presents a general framework for adapting any generative (model-based) clustering algorithm to provide balanced solutions, i.e., clusters of comparable sizes. partitional, model-based clustering algorithms are viewed as an iterative two-step optimization process—iterative model re-estimation and sample re-assignment. instead of a maximum-likelihood (ml) assignment, a balanceconstrained approach is used for the sample assignment step. an efficient iterative bipartitioning heuristic is developed to reduce the computational complexity of this step and make the balanced sample assignment algorithm scalable to large datasets. we demonstrate the superiority of this approach to regular ml clustering on complex data such as arbitraryshape 2-d spatial data, high-dimensional text documents, and eeg time series. keywords: model-based clustering, scalable algorithms, balanced clustering, constrained clustering citee id:582 citee title:generalized clustering, supervised learning, and data assignment citee abstract:clustering algorithms have become increasingly importantin handling and analyzing datafi considerable work has beendone in devising effective but increasingly specific clusteringalgorithmsfi in contrast, we have developed a generalizedframework that accommodates diverse clustering algorithmsin a systematic wayfi this framework views clustering as ageneral process of iterative optimization that includes modulesfor supervised learning and instance assignmentfi theframework has also suggested surrounding text:second, the two-step view of partitional clustering is natural and has been discussed by kalton, 1 l 2 l k l 1 o 2 o 3 o n o g w figure 1: a bipartite graph view of model-based clustering. wagstaff, and yoo [***]<1>. finally, the post-processing refinement is motivated by the observation in [7]<1> that if the balancing constraint is too strict, then often distant data points are forceably grouped together leading to substantial degradation in cluster quality influence:2 type:2 pair index:346 citer id:369 citer title:scalable, balanced model-based clustering citer abstract:this paper presents a general framework for adapting any generative (model-based) clustering algorithm to provide balanced solutions, i.e., clusters of comparable sizes. partitional, model-based clustering algorithms are viewed as an iterative two-step optimization process—iterative model re-estimation and sample re-assignment. instead of a maximum-likelihood (ml) assignment, a balanceconstrained approach is used for the sample assignment step. an efficient iterative bipartitioning heuristic is developed to reduce the computational complexity of this step and make the balanced sample assignment algorithm scalable to large datasets. we demonstrate the superiority of this approach to regular ml clustering on complex data such as arbitraryshape 2-d spatial data, high-dimensional text documents, and eeg time series. keywords: model-based clustering, scalable algorithms, balanced clustering, constrained clustering citee id:772 citee title:on clusterings �good, bad and spectral citee abstract:we motivate and develop a natural bicriteria measure for assessing the quality of a clustering that avoids the drawbacks of existing measures. a simple recursive heuristic is shown to have polylogarithmicworstcase guarantees under the newmeasure. the main result of the article is the analysis of a popular spectral algorithm. one variant of spectral clustering turns out to have effectiveworst-case guarantees; another finds a “good�clustering, if one exists. surrounding text:in addition to application requirements, balanced clustering is sometimes also helpful because it tends to decrease sensitivity to initialization and to avoid outlier clusters (highly under-utilized representatives) from forming, and thus has a beneficial regularizing effect. in fact, balance is also an important constraint for spectral graph partitioning algorithms [8, ***, 24]<2>, which could give completely useless results if the objective function is just the minimum cut instead of a modified minimum cut that favors balanced clusters. unfortunately, k-means type algorithms (including the soft em variant [6]<2>, and birch [37]<2>) are increasingly prone to yielding imbalanced solutions as the input dimensionality increases influence:3 type:2 pair index:347 citer id:369 citer title:scalable, balanced model-based clustering citer abstract:this paper presents a general framework for adapting any generative (model-based) clustering algorithm to provide balanced solutions, i.e., clusters of comparable sizes. partitional, model-based clustering algorithms are viewed as an iterative two-step optimization process—iterative model re-estimation and sample re-assignment. instead of a maximum-likelihood (ml) assignment, a balanceconstrained approach is used for the sample assignment step. an efficient iterative bipartitioning heuristic is developed to reduce the computational complexity of this step and make the balanced sample assignment algorithm scalable to large datasets. we demonstrate the superiority of this approach to regular ml clustering on complex data such as arbitraryshape 2-d spatial data, high-dimensional text documents, and eeg time series. keywords: model-based clustering, scalable algorithms, balanced clustering, constrained clustering citee id:307 citee title:cluto - a clutering toolkit citee abstract:clustering algorithms divide data into meaningful or useful groups, called clusters, such that the intra-cluster similarity is maximized and the inter-cluster similarity is minimized. these discovered clusters can be used to explain the characteristics of the underlying data distribution and thus serve as the foundation for various data mining and analysis techniques. the applications of clustering include characterization of different customer groups based upon purchasing patterns, categorization of documents on the world wide web, grouping of genes and proteins that have similar functionality, grouping of spatial locations prone to earth quakes from seismological data, etc. cluto is a software package for clustering low and high dimensional datasets and for analyzing the characteristics of the various clusters surrounding text:we first tested the balanced k-means algorithm on a synthetic but difficult dataset—the t4 dataset (fig. 5(c)) included in the cluto toolkit [***]<3>. there are no ground truth labels for this dataset but there are six natural clusters plus a lot of noise according to human judgment. there are no ground truth labels for this dataset but there are six natural clusters plus a lot of noise according to human judgment. the best algorithm that can identify all the six natural clusters uses a hybrid partitional-hierarchical approach [19, ***]<3>. it partitions the data into a large number of clusters and then merges them back to a proper granularity level. note that both ng20 and mini20 datasets contain 20 completely balanced clusters. the classic dataset is contained in the datasets for the cluto software package [***]<3> and has been used in [38]<3>. it was obtained by combining the cacm, cisi, cranfield, and medline abstracts that were used in the past to evaluate various information retrieval systems3 influence:3 type:3 pair index:348 citer id:369 citer title:scalable, balanced model-based clustering citer abstract:this paper presents a general framework for adapting any generative (model-based) clustering algorithm to provide balanced solutions, i.e., clusters of comparable sizes. partitional, model-based clustering algorithms are viewed as an iterative two-step optimization process—iterative model re-estimation and sample re-assignment. instead of a maximum-likelihood (ml) assignment, a balanceconstrained approach is used for the sample assignment step. an efficient iterative bipartitioning heuristic is developed to reduce the computational complexity of this step and make the balanced sample assignment algorithm scalable to large datasets. we demonstrate the superiority of this approach to regular ml clustering on complex data such as arbitraryshape 2-d spatial data, high-dimensional text documents, and eeg time series. keywords: model-based clustering, scalable algorithms, balanced clustering, constrained clustering citee id:293 citee title:chameleon: hierarchical clustering using dynamic modeling citee abstract:clustering is a discovery process in data mining. it groups a set of data in a way that maximizes the similarity within clusters and minimizes the similarity between two different clusters. many advanced algorithms have difficulty dealing with highly variable clusters that do not follow a preconceived model. by basing its selections on both interconnectivity and closeness, the chameleon algorithm yields accurate results for these highly variable clusters. existing algorithms use a static model of the clusters and do not use information about the nature of individual clusters as they are merged. furthermore, one set of schemes (the cure algorithm and related schemes) ignores the information about the aggregate interconnectivity of items in two clusters. another set of schemes (the rock algorithm, group averaging method, and related schemes) ignores information about the closeness of two clusters as defined by the similarity of the closest items across two clusters. by considering either interconnectivity or closeness only, these algorithms can select and merge the wrong pair of clusters. chameleon's key feature is that it accounts for both interconnectivity and closeness in identifying the most similar pair of clusters. chameleon finds the clusters in the data set by using a two-phase algorithm. during the first phase, chameleon uses a graph partitioning algorithm to cluster the data items into several relatively small subclusters. during the second phase, it uses an algorithm to find the genuine clusters by repeatedly combining these subclusters surrounding text:there are no ground truth labels for this dataset but there are six natural clusters plus a lot of noise according to human judgment. the best algorithm that can identify all the six natural clusters uses a hybrid partitional-hierarchical approach [***, 18]<3>. it partitions the data into a large number of clusters and then merges them back to a proper granularity level influence:3 type:2 pair index:349 citer id:369 citer title:scalable, balanced model-based clustering citer abstract:this paper presents a general framework for adapting any generative (model-based) clustering algorithm to provide balanced solutions, i.e., clusters of comparable sizes. partitional, model-based clustering algorithms are viewed as an iterative two-step optimization process—iterative model re-estimation and sample re-assignment. instead of a maximum-likelihood (ml) assignment, a balanceconstrained approach is used for the sample assignment step. an efficient iterative bipartitioning heuristic is developed to reduce the computational complexity of this step and make the balanced sample assignment algorithm scalable to large datasets. we demonstrate the superiority of this approach to regular ml clustering on complex data such as arbitraryshape 2-d spatial data, high-dimensional text documents, and eeg time series. keywords: model-based clustering, scalable algorithms, balanced clustering, constrained clustering citee id:52 citee title:a fast and high quality multilevel scheme for partitioning irregular graphs citee abstract:recently, a number of researchers have investigated a class of graph partitioning algorithms that reduce the size of the graph by collapsing vertices and edges, partition the smaller graph, and then uncoarsen it to construct a partition for the original graph . from the early work it was clear that multilevel techniques held great promise; however, it was not known if they can be made to consistently produce high quality partitions for graphs arising in a wide range of application domains. we investigate the e ectiveness of many di erent choices for all three phases: coarsening, partition of the coarsest graph, and refinement. in particular, we present a new coarsening heuristic (called heavy-edge heuristic) for which the size of the partition of the coarse graph is within a small factor of the size of the final partition obtained after multilevel refinement. we also present a much faster variation of the kernighan{lin (kl) algorithm for refining during uncoarsening. we test our scheme on a large number of graphs arising in various domains including finite element methods, linear programming, vlsi, and transportation. our experiments show that our scheme produces partitions that are consistently better than those produced by spectral partitioning schemes in substantially smaller time. also, when our scheme is used to compute fill-reducing orderings for sparse matrices, it produces orderings that have substantially smaller fill than the widely used multiple minimum degree algorithm surrounding text:while not an explicit goal in most clustering formulations, certain approaches such as top-down bisecting k-means [30]<2> tend to give more balanced solutions than others such as single-link agglomerative clustering. the most notable type of clustering algorithms in terms of balanced solutions is graph partitioning [***]<2>, but it needs o(n2) computation just to compute the similarity matrix. certain online approaches such as frequency sensitive competitive learning [1]<2> can also be employed for improving balancing influence:3 type:2 pair index:350 citer id:369 citer title:scalable, balanced model-based clustering citer abstract:this paper presents a general framework for adapting any generative (model-based) clustering algorithm to provide balanced solutions, i.e., clusters of comparable sizes. partitional, model-based clustering algorithms are viewed as an iterative two-step optimization process—iterative model re-estimation and sample re-assignment. instead of a maximum-likelihood (ml) assignment, a balanceconstrained approach is used for the sample assignment step. an efficient iterative bipartitioning heuristic is developed to reduce the computational complexity of this step and make the balanced sample assignment algorithm scalable to large datasets. we demonstrate the superiority of this approach to regular ml clustering on complex data such as arbitraryshape 2-d spatial data, high-dimensional text documents, and eeg time series. keywords: model-based clustering, scalable algorithms, balanced clustering, constrained clustering citee id:226 citee title:an informationtheoretic analysis of hard and soft assignment methods for clustering citee abstract:assignment methods are at the heart of many algorithms for unsupervised learning and clustering --- in particular, the well-known k-means and expectation-maximization (em) algorithms. in this work, we study several different methods of assignment, including the "hard" assignments used by k-means and the "soft" assignments used by em. while it is known that k-means minimizes the distortion on the data and em maximizes the likelihood, little is known about the systematic differences of behavior surrounding text:¤), and each model is trained using the posterior probability weighted samples. an information-theoretic analysis of these two assignment strategies has been given in [***]<1>. let us analyze these two algorithms from the perspective of objective function and explain why they usually perform very similarly in practice influence:3 type:2 pair index:351 citer id:369 citer title:scalable, balanced model-based clustering citer abstract:this paper presents a general framework for adapting any generative (model-based) clustering algorithm to provide balanced solutions, i.e., clusters of comparable sizes. partitional, model-based clustering algorithms are viewed as an iterative two-step optimization process—iterative model re-estimation and sample re-assignment. instead of a maximum-likelihood (ml) assignment, a balanceconstrained approach is used for the sample assignment step. an efficient iterative bipartitioning heuristic is developed to reduce the computational complexity of this step and make the balanced sample assignment algorithm scalable to large datasets. we demonstrate the superiority of this approach to regular ml clustering on complex data such as arbitraryshape 2-d spatial data, high-dimensional text documents, and eeg time series. keywords: model-based clustering, scalable algorithms, balanced clustering, constrained clustering citee id:288 citee title:bow: a toolkit for statistical language modeling, text retrieval, classification and clustering citee abstract:bow (or libbow) is a library of c code useful for writing statistical text analysis, language modeling and information retrieval programs. the current distribution includes the library, as well as front-ends for document classification (rainbow), document retrieval (arrow) and document clustering (crossbow). surrounding text:the ng20 dataset is a collection of 20,000 messages, collected from 20 different usenet newsgroups, 1,000 messages from each. we preprocessed the raw dataset using the bow toolkit [***]<3>, including chopping off headers and removing stop words as well as words that occur in less than three documents. in the resulting dataset, each document is represented by a 43,586-dimensional sparse vector and there are a total of 19,949 documents (with empty documents removed, still around 1,000 per category) influence:3 type:3 pair index:352 citer id:369 citer title:scalable, balanced model-based clustering citer abstract:this paper presents a general framework for adapting any generative (model-based) clustering algorithm to provide balanced solutions, i.e., clusters of comparable sizes. partitional, model-based clustering algorithms are viewed as an iterative two-step optimization process—iterative model re-estimation and sample re-assignment. instead of a maximum-likelihood (ml) assignment, a balanceconstrained approach is used for the sample assignment step. an efficient iterative bipartitioning heuristic is developed to reduce the computational complexity of this step and make the balanced sample assignment algorithm scalable to large datasets. we demonstrate the superiority of this approach to regular ml clustering on complex data such as arbitraryshape 2-d spatial data, high-dimensional text documents, and eeg time series. keywords: model-based clustering, scalable algorithms, balanced clustering, constrained clustering citee id:838 citee title:using unlabeled data to improve text classification citee abstract:one key difficulty with text classification learning algorithms is that they require many hand-labeled examples to learn accuratelyfi this dissertation demonstrates that supervised learning algorithms that use a small number of labeled examples and many inexpensive unlabeled examples can create high-accuracy text classifiersfi by assuming that documents are created by a parametric generative model, expectation-maximization (em) finds local maximum a posteriori models and classifiers from all the surrounding text:after the same preprocessing step, the resulting dataset consists of 1,998 documents in 10,633 dimensional vector space. this dataset has been used by nigam [***]<3> for text classification and is included in this paper to evaluate the performance of different models on small document collections. note that both ng20 and mini20 datasets contain 20 completely balanced clusters influence:3 type:3 pair index:353 citer id:369 citer title:scalable, balanced model-based clustering citer abstract:this paper presents a general framework for adapting any generative (model-based) clustering algorithm to provide balanced solutions, i.e., clusters of comparable sizes. partitional, model-based clustering algorithms are viewed as an iterative two-step optimization process—iterative model re-estimation and sample re-assignment. instead of a maximum-likelihood (ml) assignment, a balanceconstrained approach is used for the sample assignment step. an efficient iterative bipartitioning heuristic is developed to reduce the computational complexity of this step and make the balanced sample assignment algorithm scalable to large datasets. we demonstrate the superiority of this approach to regular ml clustering on complex data such as arbitraryshape 2-d spatial data, high-dimensional text documents, and eeg time series. keywords: model-based clustering, scalable algorithms, balanced clustering, constrained clustering citee id:189 citee title:a tutorial on hidden markov models and selected applications in speech recognition citee abstract:this tutorial provides an overview of the basic theory of hidden markov models (hmms) as originated by l.e. baum and t. petrie (1966) and gives practical details on methods of implementation of the theory along with a description of selected applications of the theory to distinct problems in speech recognition. results from a number of original sources are combined to provide a single source of acquiring the background required to pursue further this area of research. the author first reviews the theory of discrete markov chains and shows how the concept of hidden states, where the observation is a probabilistic function of the state, can be used effectively. the theory is illustrated with two simple examples, namely coin-tossing, and the classic balls-in-urns system. three fundamental problems of hmms are noted and several practical techniques for solving these problems are given. the various types of hmms that have been studied, including ergodic as well as left-right models, are described surrounding text:16) ). : k-hmms hidden markov models have long been used for modeling sequences [***]<1> and the k-hmms algorithm has been used by several authors to cluster time sequences [29, 22]<1>. the nll objective used for the k-hmms algorithm is nllk influence:3 type:3 pair index:354 citer id:369 citer title:scalable, balanced model-based clustering citer abstract:this paper presents a general framework for adapting any generative (model-based) clustering algorithm to provide balanced solutions, i.e., clusters of comparable sizes. partitional, model-based clustering algorithms are viewed as an iterative two-step optimization process—iterative model re-estimation and sample re-assignment. instead of a maximum-likelihood (ml) assignment, a balanceconstrained approach is used for the sample assignment step. an efficient iterative bipartitioning heuristic is developed to reduce the computational complexity of this step and make the balanced sample assignment algorithm scalable to large datasets. we demonstrate the superiority of this approach to regular ml clustering on complex data such as arbitraryshape 2-d spatial data, high-dimensional text documents, and eeg time series. keywords: model-based clustering, scalable algorithms, balanced clustering, constrained clustering citee id:412 citee title:deterministic annealing for clustering, compression, classification, regression, and related optimization problems citee abstract:this paperfi let us place it withinthe neural network perspective, and particularly that oflearningfi the area of neural networks has greatly benefitedfrom its unique position at the crossroads of several diversescientific and engineering disciplines including statisticsand probability theory, physics, biology, control and signalprocessing, information theory, complexity theory, and psychology(see )fi neural networks have provided a fertilesoil for the infusion (and occasionally surrounding text:keywords: model-based clustering, scalable algorithms, balanced clustering, constrained clustering 1 introduction clustering or segmentation of data is a fundamental data analysis step that has been widely studied across multiple disciplines for over 40 years [15]<1>. current clustering methods can be divided into generative (model-based) approaches [29, 6, ***]<1> and discriminative (similarity-based) approaches [36, 28, 14]<1>. parametric, model-based approaches attempt to learn generative models from the data, with each model corresponding to one particular cluster influence:2 type:2 pair index:355 citer id:369 citer title:scalable, balanced model-based clustering citer abstract:this paper presents a general framework for adapting any generative (model-based) clustering algorithm to provide balanced solutions, i.e., clusters of comparable sizes. partitional, model-based clustering algorithms are viewed as an iterative two-step optimization process—iterative model re-estimation and sample re-assignment. instead of a maximum-likelihood (ml) assignment, a balanceconstrained approach is used for the sample assignment step. an efficient iterative bipartitioning heuristic is developed to reduce the computational complexity of this step and make the balanced sample assignment algorithm scalable to large datasets. we demonstrate the superiority of this approach to regular ml clustering on complex data such as arbitraryshape 2-d spatial data, high-dimensional text documents, and eeg time series. keywords: model-based clustering, scalable algorithms, balanced clustering, constrained clustering citee id:302 citee title:clustering sequences with hidden markov models citee abstract:this paper discusses a probabilistic model-based approach to clustering sequences, using hidden markov models (hmms)fi the problemcan be framed as a generalization of the standard mixturemodel approach to clustering in feature spacefi two primary issuesare addressedfi first, a novel parameter initialization procedure isproposed, and second, the more difficult problem of determiningthe number of clusters k, from the data, is investigatedfi experimentalresults indicate that the proposed surrounding text:keywords: model-based clustering, scalable algorithms, balanced clustering, constrained clustering 1 introduction clustering or segmentation of data is a fundamental data analysis step that has been widely studied across multiple disciplines for over 40 years [15]<1>. current clustering methods can be divided into generative (model-based) approaches [***, 6, 27]<1> and discriminative (similarity-based) approaches [36, 28, 14]<1>. parametric, model-based approaches attempt to learn generative models from the data, with each model corresponding to one particular cluster. in this paper, we take a balance-constrained approach built upon the framework of probabilistic, model-based clustering [40]<1>. model based clustering is very general, and can be used to cluster a wide variety of data types, from vector data to variable length sequences of symbols or numbers [***]<1>. first, a unifying bipartite graph view is presented for model-based clustering. 16) ). : k-hmms hidden markov models have long been used for modeling sequences [26]<1> and the k-hmms algorithm has been used by several authors to cluster time sequences [***, 22]<1>. the nll objective used for the k-hmms algorithm is nllk influence:2 type:2 pair index:356 citer id:369 citer title:scalable, balanced model-based clustering citer abstract:this paper presents a general framework for adapting any generative (model-based) clustering algorithm to provide balanced solutions, i.e., clusters of comparable sizes. partitional, model-based clustering algorithms are viewed as an iterative two-step optimization process—iterative model re-estimation and sample re-assignment. instead of a maximum-likelihood (ml) assignment, a balanceconstrained approach is used for the sample assignment step. an efficient iterative bipartitioning heuristic is developed to reduce the computational complexity of this step and make the balanced sample assignment algorithm scalable to large datasets. we demonstrate the superiority of this approach to regular ml clustering on complex data such as arbitraryshape 2-d spatial data, high-dimensional text documents, and eeg time series. keywords: model-based clustering, scalable algorithms, balanced clustering, constrained clustering citee id:13 citee title:a comparison of document clustering techniques citee abstract:this paper presents the results of an experimental study of some common document clustering techniques. in particular, we compare the two main approaches to document clustering, agglomerative hierarchical clustering and k-means. (for k-means we used a “standard�k-means algorithm and a variant of k-means, “bisecting�k-means.) hierarchicalclustering is often portrayed as the better quality clustering approach, but is limited because of its quadratic time complexity. in contrast, k-means and its variants have a time complexity which is linear in the number of documents, but are thought to produce inferior clusters. sometimes k-means and agglomerative hierarchical approaches are combined so as to “get the best of both worlds.�however, our results indicate that the bisecting k-means technique is better than thestandard k-means approach and as good or better than the hierarchical approaches that we tested for a variety of cluster evaluation metrics. we propose an explanation for these results that is based on an analysis of the specifics of the clustering algorithms and the nature of document .data surrounding text:this problem is exacerbated when a large (tens or more) number of clusters are needed, and it is well known that both hard and soft k-means invariably result in some near-empty clusters in such scenarios [4, 7]<2>. while not an explicit goal in most clustering formulations, certain approaches such as top-down bisecting k-means [***]<2> tend to give more balanced solutions than others such as single-link agglomerative clustering. the most notable type of clustering algorithms in terms of balanced solutions is graph partitioning [20]<2>, but it needs o(n2) computation just to compute the similarity matrix influence:2 type:2 pair index:357 citer id:369 citer title:scalable, balanced model-based clustering citer abstract:this paper presents a general framework for adapting any generative (model-based) clustering algorithm to provide balanced solutions, i.e., clusters of comparable sizes. partitional, model-based clustering algorithms are viewed as an iterative two-step optimization process—iterative model re-estimation and sample re-assignment. instead of a maximum-likelihood (ml) assignment, a balanceconstrained approach is used for the sample assignment step. an efficient iterative bipartitioning heuristic is developed to reduce the computational complexity of this step and make the balanced sample assignment algorithm scalable to large datasets. we demonstrate the superiority of this approach to regular ml clustering on complex data such as arbitraryshape 2-d spatial data, high-dimensional text documents, and eeg time series. keywords: model-based clustering, scalable algorithms, balanced clustering, constrained clustering citee id:160 citee title:a scalable approach to balanced, high-dimensional clustering of market-baskets citee abstract:this paper presents opossum, a novel similarity-based clustering approach based on constrained, weighted graph-partitioningfi opossum is particularly attuned to real-life market baskets, characterized by very high-dimensional, highly sparse customer-product matrices with positive ordinal attribute values and significant amount of outliersfi since it is built on top of metis, a well-known and highly efficient graph partitioning algorithm, it inherits the scalable and easily parallelizeable surrounding text:a generative model based on a mixture of von mises-fisher distributions has been developed to characterize such approaches for normalized data [3]<2>. the problem of clustering large scale data under constraints such as balancing has recently received attention in the data mining literature [***, 7, 35, 4]<2>. since balancing is a global property, it is difficult to obtain near-linear time techniques to achieve this goal while retaining high cluster quality influence:1 type:2 pair index:358 citer id:369 citer title:scalable, balanced model-based clustering citer abstract:this paper presents a general framework for adapting any generative (model-based) clustering algorithm to provide balanced solutions, i.e., clusters of comparable sizes. partitional, model-based clustering algorithms are viewed as an iterative two-step optimization process—iterative model re-estimation and sample re-assignment. instead of a maximum-likelihood (ml) assignment, a balanceconstrained approach is used for the sample assignment step. an efficient iterative bipartitioning heuristic is developed to reduce the computational complexity of this step and make the balanced sample assignment algorithm scalable to large datasets. we demonstrate the superiority of this approach to regular ml clustering on complex data such as arbitraryshape 2-d spatial data, high-dimensional text documents, and eeg time series. keywords: model-based clustering, scalable algorithms, balanced clustering, constrained clustering citee id:297 citee title:cluster ensembles �a knowledge reuse framework for combining partitions citee abstract:this paper introduces the problem of combining multiple partitionings of a set of objects into a single consolidated clustering without accessing the features or algorithms that determined these partitionings. we first identify several application scenarios for the resultant `knowledge reuse' framework that we call cluster ensembles. the cluster ensemble problem is then formalized as a combinatorial optimization problem in terms of shared mutual information. in addition to a direct maximization approach, we propose three e ective and efficient techniques for obtaining high-quality combiners (consensus functions). the first combiner induces a similarity measure from the partitionings and then reclusters the objects. the second combiner is based on hypergraph partitioning. the third one collapses groups of clusters into meta-clusters which then compete for each object to determine the combined clustering. due to the low computational costs of our techniques, it is quite feasible to use a supra-consensus function that evaluates all three approaches against the objective function and picks the best solution for a given situation. we evaluate the effectiveness of cluster ensembles in three qualitatively di erent application scenarios: (i) where the original clusters were formed based on non-identical sets of features, (ii) where the original clustering algorithms worked on non-identical sets of objects, and (iii) where a common data-set is used and the main purpose of combining multiple clusterings is to improve the quality and robustness of the solution. promising results are obtained in all three situations for synthetic as well as real data-sets. surrounding text:this is a better measure than purity or entropy which are both biased towards high k solutions. for a more detailed discussion on the nmi measure, see [34, ***]<3>. 4 influence:3 type:3 pair index:359 citer id:369 citer title:scalable, balanced model-based clustering citer abstract:this paper presents a general framework for adapting any generative (model-based) clustering algorithm to provide balanced solutions, i.e., clusters of comparable sizes. partitional, model-based clustering algorithms are viewed as an iterative two-step optimization process—iterative model re-estimation and sample re-assignment. instead of a maximum-likelihood (ml) assignment, a balanceconstrained approach is used for the sample assignment step. an efficient iterative bipartitioning heuristic is developed to reduce the computational complexity of this step and make the balanced sample assignment algorithm scalable to large datasets. we demonstrate the superiority of this approach to regular ml clustering on complex data such as arbitraryshape 2-d spatial data, high-dimensional text documents, and eeg time series. keywords: model-based clustering, scalable algorithms, balanced clustering, constrained clustering citee id:835 citee title:relationship-based clustering and visualization for high-dimensional data mining citee abstract:in several real-life data-mining this paper proposes a relationship-based approach that alleviates both problems, side-stepping the "curse-of-dimensionality" issue by working in a suitable similarity space instead of the original high-dimensional attribute spacefi this intermediary similarity space can be suitably tailored to satisfy business criteria such as requiring customer clusters to represent comparable amounts of revenuefi we apply efficient and scalable graph-partitioning-basedfififi surrounding text:, can be allocated to each segment. in large retail chains, one often desires product categories/groupings of comparable importance, since subsequent decisions such as shelf/floor space allocation and product placement are influenced by the objective of allocating resources proportional to revenue or gross margins associated with the product groups [***]<2>. similarly, in clustering of a large corpus of documents to generate topic hierarchies, balancing greatly facilitates navigation by avoiding the generation of hierarchies that are highly skewed, with uneven depth in different parts of the “tree�hierarchy or having widely varying number of documents at the leaf nodes influence:3 type:2 pair index:360 citer id:369 citer title:scalable, balanced model-based clustering citer abstract:this paper presents a general framework for adapting any generative (model-based) clustering algorithm to provide balanced solutions, i.e., clusters of comparable sizes. partitional, model-based clustering algorithms are viewed as an iterative two-step optimization process—iterative model re-estimation and sample re-assignment. instead of a maximum-likelihood (ml) assignment, a balanceconstrained approach is used for the sample assignment step. an efficient iterative bipartitioning heuristic is developed to reduce the computational complexity of this step and make the balanced sample assignment algorithm scalable to large datasets. we demonstrate the superiority of this approach to regular ml clustering on complex data such as arbitraryshape 2-d spatial data, high-dimensional text documents, and eeg time series. keywords: model-based clustering, scalable algorithms, balanced clustering, constrained clustering citee id:322 citee title:impact of similarity measures on web-page clustering citee abstract:clustering of web documents enables (semi-)automatedcategorization, and facilitates certain types of searchfiany clustering method has to embed the documentsin a suitable similarity spacefi while several clusteringmethods and the associated similarity measures havebeen proposed in the past, there is no systematic comparativestudy of the impact of similarity metrics oncluster quality, possibly because the popular cost criteriado not readily translate across qualitatively di erent surrounding text:1yi(3. 12) k2: k-vmfs euclidean distance is not appropriate for clustering high dimensional normalized data such as text [***]<2>. a better metric used for text clustering is the cosine similarity, which can be derived from directional statistics �the von mises-fisher distribution [3]<2>. 17) for k-means, k-vmfs, k-multinomials and khmms, respectively. for datasets that come with original class labels, we also evaluate the quality of clustering using the normalized mutual information [***]<2>, which is defined as nmi = ph. l nh influence:3 type:2 pair index:361 citer id:369 citer title:scalable, balanced model-based clustering citer abstract:this paper presents a general framework for adapting any generative (model-based) clustering algorithm to provide balanced solutions, i.e., clusters of comparable sizes. partitional, model-based clustering algorithms are viewed as an iterative two-step optimization process—iterative model re-estimation and sample re-assignment. instead of a maximum-likelihood (ml) assignment, a balanceconstrained approach is used for the sample assignment step. an efficient iterative bipartitioning heuristic is developed to reduce the computational complexity of this step and make the balanced sample assignment algorithm scalable to large datasets. we demonstrate the superiority of this approach to regular ml clustering on complex data such as arbitraryshape 2-d spatial data, high-dimensional text documents, and eeg time series. keywords: model-based clustering, scalable algorithms, balanced clustering, constrained clustering citee id:347 citee title:constrained clustering on large database citee abstract:constrained clustering �finding clusters that satisfy user-specified constraints �is highly desirable in many applications. in this paper, we introduce the constrained clustering problem and show that traditional clustering algorithms (e.g., k-means) cannot handle it. a scalable constraint-clustering algorithm is developed in this study which starts by finding an initial solution that satisfies user-specified constraints and then refines the solution by performing confined object movements under constraints. our algorithm consists of two phases: pivot movement and deadlock resolution. for both phases, we show that finding the optimal solution is np-hard. we then propose several heuristics and show how our algorithm can scale up for large data sets using the heuristic of micro-cluster sharing. by experiments, we show the effectiveness and efficiency of the heuristics. surrounding text:a generative model based on a mixture of von mises-fisher distributions has been developed to characterize such approaches for normalized data [3]<2>. the problem of clustering large scale data under constraints such as balancing has recently received attention in the data mining literature [31, 7, ***, 4]<2>. since balancing is a global property, it is difficult to obtain near-linear time techniques to achieve this goal while retaining high cluster quality influence:1 type:2 pair index:362 citer id:369 citer title:scalable, balanced model-based clustering citer abstract:this paper presents a general framework for adapting any generative (model-based) clustering algorithm to provide balanced solutions, i.e., clusters of comparable sizes. partitional, model-based clustering algorithms are viewed as an iterative two-step optimization process—iterative model re-estimation and sample re-assignment. instead of a maximum-likelihood (ml) assignment, a balanceconstrained approach is used for the sample assignment step. an efficient iterative bipartitioning heuristic is developed to reduce the computational complexity of this step and make the balanced sample assignment algorithm scalable to large datasets. we demonstrate the superiority of this approach to regular ml clustering on complex data such as arbitraryshape 2-d spatial data, high-dimensional text documents, and eeg time series. keywords: model-based clustering, scalable algorithms, balanced clustering, constrained clustering citee id:283 citee title:birch: an efficient data clustering method for very large databases citee abstract:finding useful patterns in large datasets has attractedconsiderable interest recently, and one of the most widelystudied problems in this area is the identification of clusters,or densely populated regions, in a multi-dimensional datasetfiprior work does not adequately address the problem of largedatasets and minimization of i/o costsfithis paper presents a data clustering method namedbirch (balanced iterative reducing and clustering usinghierarchies), and demonstrates that it is surrounding text:in fact, balance is also an important constraint for spectral graph partitioning algorithms [8, 17, 24]<2>, which could give completely useless results if the objective function is just the minimum cut instead of a modified minimum cut that favors balanced clusters. unfortunately, k-means type algorithms (including the soft em variant [6]<2>, and birch [***]<2>) are increasingly prone to yielding imbalanced solutions as the input dimensionality increases. this problem is exacerbated when a large (tens or more) number of clusters are needed, and it is well known that both hard and soft k-means invariably result in some near-empty clusters in such scenarios [4, 7]<2> influence:3 type:2 pair index:363 citer id:369 citer title:scalable, balanced model-based clustering citer abstract:this paper presents a general framework for adapting any generative (model-based) clustering algorithm to provide balanced solutions, i.e., clusters of comparable sizes. partitional, model-based clustering algorithms are viewed as an iterative two-step optimization process—iterative model re-estimation and sample re-assignment. instead of a maximum-likelihood (ml) assignment, a balanceconstrained approach is used for the sample assignment step. an efficient iterative bipartitioning heuristic is developed to reduce the computational complexity of this step and make the balanced sample assignment algorithm scalable to large datasets. we demonstrate the superiority of this approach to regular ml clustering on complex data such as arbitraryshape 2-d spatial data, high-dimensional text documents, and eeg time series. keywords: model-based clustering, scalable algorithms, balanced clustering, constrained clustering citee id:390 citee title:criterion functions for document clustering: experiments and analysis citee abstract:in recent years, we have witnessed a tremendous growth in the volume of text documents available on the internet,digital libraries, news sources, and company-wide intranets. this has led to an increased interest in developingmethods that can help users to effectively navigate, summarize, and organize this information with the ultimategoal of helping them to find what they are looking for. fast and high-quality document clustering algorithms play animportant role towards this goal as they have been shown to provide both an intuitive navigation/browsing mechanismby organizing large amounts of information into a small number of meaningful clusters as well as to greatly improvethe retrieval performance either via cluster-driven dimensionality reduction, term-weighting, or query expansion. thisever-increasing importance of document clustering and the expanded range of its applications led to the developmentof a number of new and novel algorithms with different complexity-quality trade-offs. among them, a class ofclustering algorithms that have relatively low computational requirements are those that treat the clustering problemas an optimization process which seeks to maximize or minimize a particular clustering criterion function definedover the entire clustering solution.the focus of this paper is to evaluate the performance of different criterion functions for the problem of clusteringdocuments. our study involves a total of seven different criterion functions, three of which are introduced in thispaper and four that have been proposed in the past. our evaluation consists of both a comprehensive experimentalevaluation involving fifteen different datasets, as well as an analysis of the characteristics of the various criterionfunctions and their effect on the clusters they produce. our experimental results show that there are a set of criterionfunctions that consistently outperform the rest, and that some of the newly proposed criterion functions lead to thebest overall results. our theoretical analysis of the criterion function shows that their relative performance dependson (i) the degree to which they can correctly operate when the clusters are of different tightness, and (ii) the degree towhich they can lead to reasonably balanced clusters. surrounding text:note that both ng20 and mini20 datasets contain 20 completely balanced clusters. the classic dataset is contained in the datasets for the cluto software package [18]<3> and has been used in [***]<3>. it was obtained by combining the cacm, cisi, cranfield, and medline abstracts that were used in the past to evaluate various information retrieval systems3 influence:3 type:3 pair index:364 citer id:369 citer title:scalable, balanced model-based clustering citer abstract:this paper presents a general framework for adapting any generative (model-based) clustering algorithm to provide balanced solutions, i.e., clusters of comparable sizes. partitional, model-based clustering algorithms are viewed as an iterative two-step optimization process—iterative model re-estimation and sample re-assignment. instead of a maximum-likelihood (ml) assignment, a balanceconstrained approach is used for the sample assignment step. an efficient iterative bipartitioning heuristic is developed to reduce the computational complexity of this step and make the balanced sample assignment algorithm scalable to large datasets. we demonstrate the superiority of this approach to regular ml clustering on complex data such as arbitraryshape 2-d spatial data, high-dimensional text documents, and eeg time series. keywords: model-based clustering, scalable algorithms, balanced clustering, constrained clustering citee id:613 citee title:hmms and coupled hmms for multi-channel eeg classification citee abstract:a variety of coupled hmms (chmms) have recently been proposed as extensions of hmm to better characterize multiple interdependent sequences. this paper introduces a novel distance coupled hmm. it then compares the performance of several hmm and chmm models for a multi-channel eeg classification problem. the results show that, of all approaches examined, the multivariate hmm that has low computational complexity surprisingly outperforms all other models surrounding text:the data contains measurements, sampled at 256 hz for 1 second, from 64 electrodes placed on the scalp. we extracted from the archive a subset called eeg-2 [***]<3>, that contains 20 measurements for two subjects—one alcoholic and one control, for each type of three stimuli types, from 2 electrodes (f4, p8). as a result, the eeg2 dataset contains 120 data samples and each sample is a pair of sequences of length 256 influence:3 type:3 pair index:365 citer id:369 citer title:scalable, balanced model-based clustering citer abstract:this paper presents a general framework for adapting any generative (model-based) clustering algorithm to provide balanced solutions, i.e., clusters of comparable sizes. partitional, model-based clustering algorithms are viewed as an iterative two-step optimization process—iterative model re-estimation and sample re-assignment. instead of a maximum-likelihood (ml) assignment, a balanceconstrained approach is used for the sample assignment step. an efficient iterative bipartitioning heuristic is developed to reduce the computational complexity of this step and make the balanced sample assignment algorithm scalable to large datasets. we demonstrate the superiority of this approach to regular ml clustering on complex data such as arbitraryshape 2-d spatial data, high-dimensional text documents, and eeg time series. keywords: model-based clustering, scalable algorithms, balanced clustering, constrained clustering citee id:190 citee title:a unified framework for modelbased clustering and its applications to clustering time sequences citee abstract:model-based clustering techniques have been widely used and have shown promising results in many applications involving complex data. this paper presents a unified framework for probabilistic model-based clustering based on a bipartite graph view of data and models that highlights the commonalities and differences among existing model-based clustering algorithms. in this view, clusters are represented as probabilistic models in a model space that is conceptually separate from the data space. for partitional clustering, the view is conceptually similar to the expectationmaximization (em) algorithm. for hierarchical clustering, the graph-based view helps to visualize critical/important distinctions between similarity-based approaches and model-based approaches. the framework also suggests several useful variations of existing clustering algorithms. two new variations—balanced model-based clustering and hybrid model-based clustering—are discussed and empirically evaluated on a variety of data types. surrounding text:the cluster assignment subproblem was formulated as a minimum cost flow problem, which has a high (o(n3)) complexity. in this paper, we take a balance-constrained approach built upon the framework of probabilistic, model-based clustering [***]<1>. model based clustering is very general, and can be used to cluster a wide variety of data types, from vector data to variable length sequences of symbols or numbers [29]<1>. the closeness measure is used for assigning samples to clusters in partitional clustering and the distance measure for finding the closest pair of clusters to merge in hierarchical agglomerative clustering. readers are referred to [***]<1> for a detailed description. the maximum-likelihood method is often used for training a model. j = 1=k. 8j leads to a more constrained version of soft clustering [***]<1>. note that the general formula (2 influence:1 type:1 pair index:366 citer id:373 citer title:continuous time bayesian networks citer abstract:in this paper we present a language for finite state continuous time bayesian networks (ctbns), which describe structured stochastic processes that evolve over continuous time. the state of the system is decomposed into a set of local variables whose values change over time. the dynamics of the system are described by specifying the behavior of each local variable as a function of its parents in a directed (possibly cyclic) graph. the model specifies, at any given point in time, the distribution over two aspects: when a local variable changes its value and the next value it takes. these distributions are determined by the variable's current value and the current values of its parents in the graph. more formally, each variable is modelled as a finite state continuous time markov process whose transition intensities are functions of its parents. we present a probabilistic semantics for the language in terms of the generative model a ctbn defines over sequences of events. we list types of queries one might ask of a ctbn, discuss the conceptual and computational difficulties associated with exact inference, and provide an algorithm for approximate inference which takes advantage of the structure within the process citee id:374 citee title:event history analysis citee abstract:the purpose of event history analysis is to explain why certain individualsare at a higher risk of experiencing the event(s) of interest than others. thiscan be accomplished by using special types of methods which, depending onthe field in which they are applied, are called failure-time models, life-timemodels, survival models, transition-rate models, response-time models, eventhistory models, duration models, or hazard models. examples of textbooksdiscussing this class of techniques are , , , , , , and . here,we will use the terms event history, survival, and hazard models interchange-ably.a hazard model is a regression model in which the “riskâ€?of experienc-ing an event at a certain time point is predicted with a set of covariates.two special features distinguish hazard models from other types of regres-sion models. the first is that they make it possible to deal with censoredobservations, which contain only partial information on the timing of theevent of interest. another special feature is that covariates may change theirvalue during the observation period. the possibility of including such time-varying covariates makes it possible to perform a truly dynamic analysis.before discussing in more detail the most important types of hazard models,we will first introduce some basic concepts. surrounding text:anderson et al. , 1993)[***]<2>[2]<2> and markov process models (duffie et al. , 1996. even though the transition intensity for z depends only on the value of y at any instance in time, as soon as we consider temporal evolution, their states become correlated. this problem is completely analogous to the entanglement problem in dbns (boyen& koller, 1998)[***]<2>, where all variables in the dbn typically become correlated over some number of time slices. the primary difference is that, in continuous time, even the smallest time increment dt results in the same level of entanglement as we would gain from an arbitrary number of time slices in a dbn influence:3 type:2 pair index:367 citer id:373 citer title:continuous time bayesian networks citer abstract:in this paper we present a language for finite state continuous time bayesian networks (ctbns), which describe structured stochastic processes that evolve over continuous time. the state of the system is decomposed into a set of local variables whose values change over time. the dynamics of the system are described by specifying the behavior of each local variable as a function of its parents in a directed (possibly cyclic) graph. the model specifies, at any given point in time, the distribution over two aspects: when a local variable changes its value and the next value it takes. these distributions are determined by the variable's current value and the current values of its parents in the graph. more formally, each variable is modelled as a finite state continuous time markov process whose transition intensities are functions of its parents. we present a probabilistic semantics for the language in terms of the generative model a ctbn defines over sequences of events. we list types of queries one might ask of a ctbn, discuss the conceptual and computational difficulties associated with exact inference, and provide an algorithm for approximate inference which takes advantage of the structure within the process citee id:140 citee title:a model for reasoning about persistence and causation citee abstract:reasoning about change requires predicting how long a proposition, having become true, will continue to be so. lacking perfect knowledge, an agent may be constrained to believe that a proposition persists indefinitely simply because there is no way for the agent to infer a contravening proposition with certainty. in this paper, we describe a model of causal reasoning that accounts for knowledge concerning cause-and-effect relationships and knowledge concerning the tendency for propositions to persist or not as a function of time passing. our model has a natural encoding in the form of a network representation for probabilistic models. we consider the computational properties of our model by reviewing recent advances in computing the consequences of models encoded in this network representation. finally, we discuss how our probabilistic model addresses certain classical problems in temporal reasoning (e. g., the frame and qualification problems). surrounding text:with such a representation we can be explicit about the direct dependencies which are present and use the independencies to our advantage computationally, however, bayesian networks are designed to reason about static processes, and cannot be used directly to answer the types of questions that concern us here. dynamic bayesian networks (dbns) (dean & kanazawa, 1989)[***]<1> are the standard extension of bayesian networks to temporal processes. dbns model a dynamic system by discretizing time and providing a bayesian network fragment that represents the probabilistic transition of the state at time t to the state at time t +1 influence:3 type:1 pair index:368 citer id:373 citer title:continuous time bayesian networks citer abstract:in this paper we present a language for finite state continuous time bayesian networks (ctbns), which describe structured stochastic processes that evolve over continuous time. the state of the system is decomposed into a set of local variables whose values change over time. the dynamics of the system are described by specifying the behavior of each local variable as a function of its parents in a directed (possibly cyclic) graph. the model specifies, at any given point in time, the distribution over two aspects: when a local variable changes its value and the next value it takes. these distributions are determined by the variable's current value and the current values of its parents in the graph. more formally, each variable is modelled as a finite state continuous time markov process whose transition intensities are functions of its parents. we present a probabilistic semantics for the language in terms of the generative model a ctbn defines over sequences of events. we list types of queries one might ask of a ctbn, discuss the conceptual and computational difficulties associated with exact inference, and provide an algorithm for approximate inference which takes advantage of the structure within the process citee id:376 citee title:recursive valuation of defaultable securities and the timing of resolution of uncertainty citee abstract:we derive the implications of default risk for valuation of securities in an setting in which the fractional default recovery rate and the hazard rate for default may depend on the market value of the instrument itself, or on the market values of other instruments issued by the same entity (which are determined simultaneously). a key technique is the use of backward recursive stochastic integral equations. we characterize the dependence of the market value on the manner of resolution of uncertainty, and in particular give conditions for monotonicity of value with respect to the information filtration surrounding text:, 1996. lando, 1998)[***]<2>[10]<2>. work well, but do not allow the specification of models with a large structured state space where some variables do not directly depend on others influence:3 type:2 pair index:369 citer id:373 citer title:continuous time bayesian networks citer abstract:in this paper we present a language for finite state continuous time bayesian networks (ctbns), which describe structured stochastic processes that evolve over continuous time. the state of the system is decomposed into a set of local variables whose values change over time. the dynamics of the system are described by specifying the behavior of each local variable as a function of its parents in a directed (possibly cyclic) graph. the model specifies, at any given point in time, the distribution over two aspects: when a local variable changes its value and the next value it takes. these distributions are determined by the variable's current value and the current values of its parents in the graph. more formally, each variable is modelled as a finite state continuous time markov process whose transition intensities are functions of its parents. we present a probabilistic semantics for the language in terms of the generative model a ctbn defines over sequences of events. we list types of queries one might ask of a ctbn, discuss the conceptual and computational difficulties associated with exact inference, and provide an algorithm for approximate inference which takes advantage of the structure within the process citee id:377 citee title:probabilistic temporal reasoning with endogenous change citee abstract:this paper presents a probabilistic model for reasoning about the state of a system as it changes over time, both due to exogenous and endogenous influencesfi our target domain is a class of medical prediction problems that are neither so urgent as to preclude careful diagnosis nor progress so slowly as to allow arbitrary testing and treatment optionsfi in these domains there is typically enough time to gather information about the patient"s state and consider alternative diagnoses and surrounding text:hanks et al. (1995)[***]<2> present another discrete time approach to temporal reasoning related to dbns which they extend with a rule-based formalism to model endogenous changes to variables which occur between exogenous events. they also include an extensive discussion of various approaches probabilistic temporal reasoning influence:2 type:2 pair index:370 citer id:373 citer title:continuous time bayesian networks citer abstract:in this paper we present a language for finite state continuous time bayesian networks (ctbns), which describe structured stochastic processes that evolve over continuous time. the state of the system is decomposed into a set of local variables whose values change over time. the dynamics of the system are described by specifying the behavior of each local variable as a function of its parents in a directed (possibly cyclic) graph. the model specifies, at any given point in time, the distribution over two aspects: when a local variable changes its value and the next value it takes. these distributions are determined by the variable's current value and the current values of its parents in the graph. more formally, each variable is modelled as a finite state continuous time markov process whose transition intensities are functions of its parents. we present a probabilistic semantics for the language in terms of the generative model a ctbn defines over sequences of events. we list types of queries one might ask of a ctbn, discuss the conceptual and computational difficulties associated with exact inference, and provide an algorithm for approximate inference which takes advantage of the structure within the process citee id:378 citee title:on cox processes and credit risky securities citee abstract:a framework is presented for modeling defaultable securities and credit derivatives which allows for dependence between market risk factors and credit risk. the framework reduces the technical issues of modeling credit risk to the same issues faced when modeling the ordinary term structure of interest rates. it is shown how to generalize a model of jarrow, lando and turnbull (1997) to allow for stochastic transition intensities between rating categories and into default. this generalization can handle contracts with payments explicitly linked to ratings. it is also shown how to obtain a term structure model for all different rating categories simultaneously and how to obtain an affine-like structure. an implementation is given in a simple one factor model in which the affine structure gives closed form solutions. surrounding text:, 1996. lando, 1998)[5]<2>[***]<2>. work well, but do not allow the specification of models with a large structured state space where some variables do not directly depend on others. rather, the intensities are a function of the current values of a set of other variables, which also evolve as markov processes. we note that a similar model was used by lando (1998)[***]<2>, but the conditioning variables were not viewed as markov processes, nor were they built into any larger structured model, as in our framework. let y be a variable whose domain is val(y) = fy1 influence:3 type:2 pair index:371 citer id:373 citer title:continuous time bayesian networks citer abstract:in this paper we present a language for finite state continuous time bayesian networks (ctbns), which describe structured stochastic processes that evolve over continuous time. the state of the system is decomposed into a set of local variables whose values change over time. the dynamics of the system are described by specifying the behavior of each local variable as a function of its parents in a directed (possibly cyclic) graph. the model specifies, at any given point in time, the distribution over two aspects: when a local variable changes its value and the next value it takes. these distributions are determined by the variable's current value and the current values of its parents in the graph. more formally, each variable is modelled as a finite state continuous time markov process whose transition intensities are functions of its parents. we present a probabilistic semantics for the language in terms of the generative model a ctbn defines over sequences of events. we list types of queries one might ask of a ctbn, discuss the conceptual and computational difficulties associated with exact inference, and provide an algorithm for approximate inference which takes advantage of the structure within the process citee id:379 citee title:probabilistic reasoning in intelligent systems citee abstract:probabilistic reasoning in intelligent systems is a complete andaccessible account of the theoretical foundations and computational methods that underlie plausible reasoning under uncertainty. the author provides a coherent explication of probability as a language for reasoning with partial belief and offers a unifying perspective on other ai approaches to uncertainty, such as the dempster-shafer formalism, truth maintenance systems, and nonmonotonic logic. the author distinguishes syntactic and semantic approaches to uncertainty—and offers techniques, based on belief networks, that provide a mechanism for making semantics-based systems operational. specifically, network-propagation techniques serve as a mechanism for combining the theoretical coherence of probability theory with modern demands of reasoning-systems technology: modular declarative inputs, conceptually meaningful inferences, and parallel distributed computation. application areas include diagnosis, forecasting, image interpretation, multi-sensor fusion, decision support systems, plan recognition, planning, speech recognition—in short, almost every task requiring that conclusions be drawn from uncertain clues and incomplete information. probabilistic reasoning in intelligent systems will be of special interest to scholars and researchers in ai, decision theory, statistics, logic, philosophy, cognitive psychology, and the management sciences. professionals in the areas of knowledge-based systems, operations research, engineering, and statistics will find theoretical and computational tools of immediate practical use. the book can also be used as an excellent text for graduate-level courses in ai, operations research, or applied probability. surrounding text:for example, the distribution over how fast a drug takes effect might be mediated by how fast it reaches the bloodstream which may itself be affected by how recently the person has eaten. bayesian networks (pearl, 1988)[***]<1> are a standard approach for modelling structured domains. with such a representation we can be explicit about the direct dependencies which are present and use the independencies to our advantage computationally, however, bayesian networks are designed to reason about static processes, and cannot be used directly to answer the types of questions that concern us here influence:2 type:1 pair index:372 citer id:373 citer title:continuous time bayesian networks citer abstract:in this paper we present a language for finite state continuous time bayesian networks (ctbns), which describe structured stochastic processes that evolve over continuous time. the state of the system is decomposed into a set of local variables whose values change over time. the dynamics of the system are described by specifying the behavior of each local variable as a function of its parents in a directed (possibly cyclic) graph. the model specifies, at any given point in time, the distribution over two aspects: when a local variable changes its value and the next value it takes. these distributions are determined by the variable's current value and the current values of its parents in the graph. more formally, each variable is modelled as a finite state continuous time markov process whose transition intensities are functions of its parents. we present a probabilistic semantics for the language in terms of the generative model a ctbn defines over sequences of events. we list types of queries one might ask of a ctbn, discuss the conceptual and computational difficulties associated with exact inference, and provide an algorithm for approximate inference which takes advantage of the structure within the process citee id:380 citee title:probability propagation citee abstract:in this paper we give a simple account of local computation of marginalprobabilities for when the joint probability distribution is given in factored formand the sets of variables involved in the factors form a hypertree. previousexpositions of such local computation have emphasized conditional probability.we believe this emphasis is misplaced. what is essential to local computation isa factorization. it is not essential that this factorization be interpreted in terms ofconditional probabilities. the account given here avoids the divisions required byconditional probabilities and generalizes readily to alternative measures ofsubjective probability, such dempster-shafer or spohnian belief functions. surrounding text:5. 1 the clique tree algorithm roughly speaking, the basic clique tree calibration step is almost identical to the propagation used in the shafer and shenoy (1990)[***]<2> algorithm, except that we use amalgamation as a substitute for products and approximate marginalization as a substitute for standard marginalization. initialization of the clique tree we begin by constructing the clique tree for the graph g influence:3 type:3 pair index:373 citer id:381 citer title:correlated topic models citer abstract:topic models, such as latent dirichlet allocation (lda), can be useful tools for the statistical analysis of document collections and other discrete data. the lda model assumes that the words of each document arise from a mixture of topics, each of which is a distribution over the vocabulary. a limitation of lda is the inability to model topic correlation even though, for example, a document about genetics is more likely to also be about disease than x-ray astronomy. this limitation stems from the use of the dirichlet distribution to model the variability among the topic proportions. in this paper we develop the correlated topic model (ctm), where the topic proportions exhibit correlation via the logistic normal distribution . we derive a mean-field variational inference algorithm for approximate posterior inference in this model, which is complicated by the fact that the logistic normal is not conjugate to the multinomial. the ctm gives a better fit than lda on a collection of ocred articles from the journal science. furthermore, the ctm provides a natural way of visualizing and exploring this and other unstructured data sets citee id:382 citee title:vibes: a variational inference engine for bayesian networks citee abstract:in recent years variational methods have become a popular tool for approximate inference and learning in a wide variety of probabilistic models. for each new application, however, it is currently necessary first to derive the variational update equations, and then to implement them in application-specific code. each of these steps is both time consuming and error prone. in this paper we describe a general purpose inference engine called vibes (`variational inference for bayesian networks') which allows a wide variety of probabilistic models to be implemented and solved variationally without recourse to coding. new models are specified either through a simple script or via a graphical interface analogous to a drawing package. vibes then automatically generates and solves the variational equations. we illustrate the power and exibility of vibes using examples from bayesian mixture modelling surrounding text:see [13]<1> for a modern review of variational methods for statistical inference. in graphical models composed of conjugate-exponential family pairs and mixtures, the variational inference algorithm can be automatically derived from general principles [***, 14]<1>. in the ctm, however, the logistic normal is not conjugate to the multinomial influence:2 type:3 pair index:374 citer id:381 citer title:correlated topic models citer abstract:topic models, such as latent dirichlet allocation (lda), can be useful tools for the statistical analysis of document collections and other discrete data. the lda model assumes that the words of each document arise from a mixture of topics, each of which is a distribution over the vocabulary. a limitation of lda is the inability to model topic correlation even though, for example, a document about genetics is more likely to also be about disease than x-ray astronomy. this limitation stems from the use of the dirichlet distribution to model the variability among the topic proportions. in this paper we develop the correlated topic model (ctm), where the topic proportions exhibit correlation via the logistic normal distribution . we derive a mean-field variational inference algorithm for approximate posterior inference in this model, which is complicated by the fact that the logistic normal is not conjugate to the multinomial. the ctm gives a better fit than lda on a collection of ocred articles from the journal science. furthermore, the ctm provides a natural way of visualizing and exploring this and other unstructured data sets citee id:164 citee title:latent dirichlet allocation citee abstract:we describe latent dirichlet allocation (lda), a generative probabilistic model for collections of discrete data such as text corpora. lda is a three-level hierarchical bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. in the context of text modeling, the topic probabilities provide an explicit representation of a document. we present efficient approximate inference techniques based on variational methods and an em algorithm for empirical bayes parameter estimation. we report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic lsi model surrounding text:when fit from data, these distributions often correspond to intuitive notions of topicality. in this work, we build upon the latent dirichlet allocation (lda) [***]<1> model. lda assumes that the words of each document arise from a mixture of topics influence:1 type:1 pair index:375 citer id:381 citer title:correlated topic models citer abstract:topic models, such as latent dirichlet allocation (lda), can be useful tools for the statistical analysis of document collections and other discrete data. the lda model assumes that the words of each document arise from a mixture of topics, each of which is a distribution over the vocabulary. a limitation of lda is the inability to model topic correlation even though, for example, a document about genetics is more likely to also be about disease than x-ray astronomy. this limitation stems from the use of the dirichlet distribution to model the variability among the topic proportions. in this paper we develop the correlated topic model (ctm), where the topic proportions exhibit correlation via the logistic normal distribution . we derive a mean-field variational inference algorithm for approximate posterior inference in this model, which is complicated by the fact that the logistic normal is not conjugate to the multinomial. the ctm gives a better fit than lda on a collection of ocred articles from the journal science. furthermore, the ctm provides a natural way of visualizing and exploring this and other unstructured data sets citee id:383 citee title:simplicial mixtures of markov chains: distributed modelling of dynamic user profiles citee abstract:to provide a compact generative representation of the sequential activity of a number of individuals within a group there is a tradeoff between the de.ni-tion of individual speci.c and global models. this paper proposes a linear-time distributed model for .nite state symbolic sequences representing traces of individual user activity by making the assumption that heterogeneous user be-havior may be ‘explainedâ€?by a relatively small number of structurally simple common behavioral patterns which may interleave randomly in a user-speci.c proportion. the results of an empirical study on three different sources of user traces indicates that this modelling approach provides an ef.cient representation scheme, re.ected by improved prediction performance as well as providing low-complexity and intuitively interpretable representations surrounding text:lda allows each document to exhibit multiple topics with different proportions, and it can thus capture the heterogeneity in grouped data that exhibit multiple latent patterns. recent work has used lda in more complicated document models [9, 11, 7]<3>, and in a variety of settings such as image processing [12]<3>, collaborative filtering [8]<3>, and the modeling of sequential data and user profiles [***]<3>. similar models were independently developed for disability survey data [5]<3> and population genetics [10]<3> influence:3 type:3 pair index:376 citer id:381 citer title:correlated topic models citer abstract:topic models, such as latent dirichlet allocation (lda), can be useful tools for the statistical analysis of document collections and other discrete data. the lda model assumes that the words of each document arise from a mixture of topics, each of which is a distribution over the vocabulary. a limitation of lda is the inability to model topic correlation even though, for example, a document about genetics is more likely to also be about disease than x-ray astronomy. this limitation stems from the use of the dirichlet distribution to model the variability among the topic proportions. in this paper we develop the correlated topic model (ctm), where the topic proportions exhibit correlation via the logistic normal distribution . we derive a mean-field variational inference algorithm for approximate posterior inference in this model, which is complicated by the fact that the logistic normal is not conjugate to the multinomial. the ctm gives a better fit than lda on a collection of ocred articles from the journal science. furthermore, the ctm provides a natural way of visualizing and exploring this and other unstructured data sets citee id:384 citee title:integrating topics and syntax citee abstract:statistical approaches to language learning typically focus on either short-range syntactic dependencies or long-range semantic dependencies between words. we present a generative model that uses both kinds of dependencies, and is capable of simultaneously finding syntactic classes and semantic topics despite having no knowledge of syntax or semantics beyond statistical dependency. this model is competitive on tasks like part-of-speech tagging and document classification with models that exclusively use short- and long-range dependencies respectively surrounding text:lda allows each document to exhibit multiple topics with different proportions, and it can thus capture the heterogeneity in grouped data that exhibit multiple latent patterns. recent work has used lda in more complicated document models [9, 11, ***]<3>, and in a variety of settings such as image processing [12]<3>, collaborative filtering [8]<3>, and the modeling of sequential data and user profiles [6]<3>. similar models were independently developed for disability survey data [5]<3> and population genetics [10]<3> influence:3 type:3 pair index:377 citer id:381 citer title:correlated topic models citer abstract:topic models, such as latent dirichlet allocation (lda), can be useful tools for the statistical analysis of document collections and other discrete data. the lda model assumes that the words of each document arise from a mixture of topics, each of which is a distribution over the vocabulary. a limitation of lda is the inability to model topic correlation even though, for example, a document about genetics is more likely to also be about disease than x-ray astronomy. this limitation stems from the use of the dirichlet distribution to model the variability among the topic proportions. in this paper we develop the correlated topic model (ctm), where the topic proportions exhibit correlation via the logistic normal distribution . we derive a mean-field variational inference algorithm for approximate posterior inference in this model, which is complicated by the fact that the logistic normal is not conjugate to the multinomial. the ctm gives a better fit than lda on a collection of ocred articles from the journal science. furthermore, the ctm provides a natural way of visualizing and exploring this and other unstructured data sets citee id:327 citee title:collaborative filtering: a machine learning perspective citee abstract:collaborative filtering was initially proposed as a framework for filtering information based on the preferences of users, and has since been refined in many di erent ways. this thesis is a comprehensive study of rating-based, pure, non-sequential collaborative filtering. we analyze existing methods for the task of rating prediction from a machine learning perspective. we show that many existing methods proposed for this task are simple applications or modifications of one or more standard machine learning methods for classification, regression, clustering, dimensionality reduction, and density estimation. we introduce new prediction methods in all of these classes. we introduce a new experimental procedure for testing stronger forms of generalization than has been used previously. we implement a total of nine prediction methods, and conduct large scale prediction accuracy experiments. we show interesting new results on the relative performance of these methods. surrounding text:lda allows each document to exhibit multiple topics with different proportions, and it can thus capture the heterogeneity in grouped data that exhibit multiple latent patterns. recent work has used lda in more complicated document models [9, 11, 7]<3>, and in a variety of settings such as image processing [12]<3>, collaborative filtering [***]<3>, and the modeling of sequential data and user profiles [6]<3>. similar models were independently developed for disability survey data [5]<3> and population genetics [10]<3> influence:3 type:3 pair index:378 citer id:381 citer title:correlated topic models citer abstract:topic models, such as latent dirichlet allocation (lda), can be useful tools for the statistical analysis of document collections and other discrete data. the lda model assumes that the words of each document arise from a mixture of topics, each of which is a distribution over the vocabulary. a limitation of lda is the inability to model topic correlation even though, for example, a document about genetics is more likely to also be about disease than x-ray astronomy. this limitation stems from the use of the dirichlet distribution to model the variability among the topic proportions. in this paper we develop the correlated topic model (ctm), where the topic proportions exhibit correlation via the logistic normal distribution . we derive a mean-field variational inference algorithm for approximate posterior inference in this model, which is complicated by the fact that the logistic normal is not conjugate to the multinomial. the ctm gives a better fit than lda on a collection of ocred articles from the journal science. furthermore, the ctm provides a natural way of visualizing and exploring this and other unstructured data sets citee id:385 citee title:inference of population structure using multilocus genotype data citee abstract:we describe a model-based clustering method for using multilocus genotype data to infer population structure and assign individuals to populations. we assume a model in which there are k populations (where k may be unknown), each of which is characterized by a set of allele frequencies at each locus. individuals in the sample are assigned (probabilistically) to populations, or jointly to two or more populations if their genotypes indicate that they are admixed. our model does not assume a particular mutation process, and it can be applied to most of the commonly used genetic markers, provided that they are not closely linked. applications of our method include demonstrating the presence of population structure, assigning individuals to populations, studying hybrid zones, and identifying migrants and admixed individuals. we show that the method can produce highly accurate assignments using modest numbers of loci—e.g., seven microsatellite loci in an example using genotype data from an endangered bird species. the software used for this article is available from http://www.stats.ox.ac.uk/zpritch/home.html. surrounding text:recent work has used lda in more complicated document models [9, 11, 7]<3>, and in a variety of settings such as image processing [12]<3>, collaborative filtering [8]<3>, and the modeling of sequential data and user profiles [6]<3>. similar models were independently developed for disability survey data [5]<3> and population genetics [***]<3>. our goal in this paper is to address a limitation of the topic models proposed to date: they fail to directly model correlation between topics influence:3 type:3 pair index:379 citer id:381 citer title:correlated topic models citer abstract:topic models, such as latent dirichlet allocation (lda), can be useful tools for the statistical analysis of document collections and other discrete data. the lda model assumes that the words of each document arise from a mixture of topics, each of which is a distribution over the vocabulary. a limitation of lda is the inability to model topic correlation even though, for example, a document about genetics is more likely to also be about disease than x-ray astronomy. this limitation stems from the use of the dirichlet distribution to model the variability among the topic proportions. in this paper we develop the correlated topic model (ctm), where the topic proportions exhibit correlation via the logistic normal distribution . we derive a mean-field variational inference algorithm for approximate posterior inference in this model, which is complicated by the fact that the logistic normal is not conjugate to the multinomial. the ctm gives a better fit than lda on a collection of ocred articles from the journal science. furthermore, the ctm provides a natural way of visualizing and exploring this and other unstructured data sets citee id:386 citee title:discovering object categories in image collections citee abstract:given a set of images containing multiple object categories, we seek to discover those categories and their image locations without supervision. we achieve this using generative models from the statistical text literature: probabilistic latent semantic analysis (plsa), and latent dirichlet allocation (lda). in text analysis these are used to discover topics in a corpus using the bag-of-words document representation. here we discover topics as object categories, so that an image containing instances of several categories is modelled as a mixture of topics. the models are applied to images by using a visual analogue of a word, formed by vector quantizing sift like region descriptors. we investigate a set of increasingly demanding scenarios, starting with image sets containing only two object categories through to sets containing multiple categories (including airplanes, cars, faces, motorbikes, spotted cats) and background clutter. the object categories sample both intra-class and scale variation, and both the categories and their approximate spatial layout are found without supervision. we also demonstrate classi.cation of unseen images and images containing multiple objects. performance of the proposed unsupervised method is compared to the semi-supervised approach of .1 1this work was sponsored in part by the eu project cogvisys, the university of oxford, shell oil, and the national geospatial-intelligence agency surrounding text:lda allows each document to exhibit multiple topics with different proportions, and it can thus capture the heterogeneity in grouped data that exhibit multiple latent patterns. recent work has used lda in more complicated document models [9, ***, 7]<3>, and in a variety of settings such as image processing [12]<3>, collaborative filtering [8]<3>, and the modeling of sequential data and user profiles [6]<3>. similar models were independently developed for disability survey data [5]<3> and population genetics [10]<3> influence:3 type:3 pair index:380 citer id:381 citer title:correlated topic models citer abstract:topic models, such as latent dirichlet allocation (lda), can be useful tools for the statistical analysis of document collections and other discrete data. the lda model assumes that the words of each document arise from a mixture of topics, each of which is a distribution over the vocabulary. a limitation of lda is the inability to model topic correlation even though, for example, a document about genetics is more likely to also be about disease than x-ray astronomy. this limitation stems from the use of the dirichlet distribution to model the variability among the topic proportions. in this paper we develop the correlated topic model (ctm), where the topic proportions exhibit correlation via the logistic normal distribution . we derive a mean-field variational inference algorithm for approximate posterior inference in this model, which is complicated by the fact that the logistic normal is not conjugate to the multinomial. the ctm gives a better fit than lda on a collection of ocred articles from the journal science. furthermore, the ctm provides a natural way of visualizing and exploring this and other unstructured data sets citee id:194 citee title:a variational principle for graphical models citee abstract:graphical models bring together graph theory and probability theory in a powerful formalism for multivariate statistical modeling. in statistical signal processing -- as well as in related fields such as communication theory, control theory and bioinformatics -- statistical models have long been formulated in terms of graphs, and algorithms for computing basic statistical quantities such as likelihoods and marginal probabilities have often been expressed in terms of recursions operating on... surrounding text:the tradeoff is that variational methods do not come with the same theoretical guarantees as simulation methods. see [***]<1> for a modern review of variational methods for statistical inference. in graphical models composed of conjugate-exponential family pairs and mixtures, the variational inference algorithm can be automatically derived from general principles [2, 14]<1> influence:3 type:1 pair index:381 citer id:381 citer title:correlated topic models citer abstract:topic models, such as latent dirichlet allocation (lda), can be useful tools for the statistical analysis of document collections and other discrete data. the lda model assumes that the words of each document arise from a mixture of topics, each of which is a distribution over the vocabulary. a limitation of lda is the inability to model topic correlation even though, for example, a document about genetics is more likely to also be about disease than x-ray astronomy. this limitation stems from the use of the dirichlet distribution to model the variability among the topic proportions. in this paper we develop the correlated topic model (ctm), where the topic proportions exhibit correlation via the logistic normal distribution . we derive a mean-field variational inference algorithm for approximate posterior inference in this model, which is complicated by the fact that the logistic normal is not conjugate to the multinomial. the ctm gives a better fit than lda on a collection of ocred articles from the journal science. furthermore, the ctm provides a natural way of visualizing and exploring this and other unstructured data sets citee id:88 citee title:a generalized mean field algorithm for variational inference in exponential families citee abstract:: we present a class of generalized mean field (gmf) algorithms for approximate inference in exponential family graphical models which is analogous to the generalized belief propagation (gbp) or cluster variational methodsfi while those methods are based on surrounding text:see [13]<1> for a modern review of variational methods for statistical inference. in graphical models composed of conjugate-exponential family pairs and mixtures, the variational inference algorithm can be automatically derived from general principles [2, ***]<1>. in the ctm, however, the logistic normal is not conjugate to the multinomial influence:3 type:1 pair index:382 citer id:382 citer title:vibes: a variational inference engine for bayesian networks citer abstract:in recent years variational methods have become a popular tool for approximate inference and learning in a wide variety of probabilistic models. for each new application, however, it is currently necessary first to derive the variational update equations, and then to implement them in application-specific code. each of these steps is both time consuming and error prone. in this paper we describe a general purpose inference engine called vibes (`variational inference for bayesian networks') which allows a wide variety of probabilistic models to be implemented and solved variationally without recourse to coding. new models are specified either through a simple script or via a graphical interface analogous to a drawing package. vibes then automatically generates and solves the variational equations. we illustrate the power and exibility of vibes using examples from bayesian mixture modelling citee id:229 citee title:an introduction to variational methods for graphical models citee abstract:this paper presents a tutorial introduction to the use of variational methods for inference and learningin graphical models (bayesian networks and markov random fields). we present a number of examples of graphicalmodels, including the qmr-dt database, the sigmoid belief network, the boltzmann machine, and several variantsof hidden markov models, in which it is infeasible to run exact inference algorithms. we then introduce variationalmethods, which exploit laws of large numbers to transform the original graphical model into a simplified graphicalmodel in which inference is efficient. inference in the simpified model provides bounds on probabilities of interestin the original model. we describe a general framework for generating variational transformations based on convexduality. finally we return to the examples and demonstrate how variational algorithms can be formulated in eachcase. surrounding text:we illustrate the power and exibility of vibes using examples from bayesian mixture modelling. 1 introduction variational methods [***, 2]<1> have been used successfully for a wide range of models, and new applications are constantly being explored. in many ways the variational framework can be seen as a complementary approach to that of markov chain monte carlo (mcmc), with di erent strengths and weaknesses. finally, we have seen that continuous nodes can have both discrete and continuous parents but that discrete nodes can only have discrete parents. we can allow discrete nodes to have continuous parents by stepping outside the conjugate-exponential framework by exploiting a variational bound on the logistic sigmoid function [***]<1>. we also wish to be able to evaluate the lower bound (2), both to confirm the correctness of the variational updates (since the value of the bound should never decrease), as well as to monitor convergence and set termination criteria influence:2 type:3 pair index:383 citer id:382 citer title:vibes: a variational inference engine for bayesian networks citer abstract:in recent years variational methods have become a popular tool for approximate inference and learning in a wide variety of probabilistic models. for each new application, however, it is currently necessary first to derive the variational update equations, and then to implement them in application-specific code. each of these steps is both time consuming and error prone. in this paper we describe a general purpose inference engine called vibes (`variational inference for bayesian networks') which allows a wide variety of probabilistic models to be implemented and solved variationally without recourse to coding. new models are specified either through a simple script or via a graphical interface analogous to a drawing package. vibes then automatically generates and solves the variational equations. we illustrate the power and exibility of vibes using examples from bayesian mixture modelling citee id:146 citee title:a new view of the em algorithm that justifies incremental and other variants citee abstract:we present a new view of the em algorithm for maximum likelihood estimationin situations with unobserved variablesfi in this view, both the e and the m stepsof the algorithm are seen as maximizing a joint function of the model parametersand of the distribution over unobserved variablesfi from this perspective, it iseasy to justify an incremental variant of the algorithm in which the distributionfor only one of the unobserved variables is recalculated in each e stepfi thisvariant is shown surrounding text:we illustrate the power and exibility of vibes using examples from bayesian mixture modelling. 1 introduction variational methods [1, ***]<1> have been used successfully for a wide range of models, and new applications are constantly being explored. in many ways the variational framework can be seen as a complementary approach to that of markov chain monte carlo (mcmc), with di erent strengths and weaknesses influence:2 type:3 pair index:384 citer id:382 citer title:vibes: a variational inference engine for bayesian networks citer abstract:in recent years variational methods have become a popular tool for approximate inference and learning in a wide variety of probabilistic models. for each new application, however, it is currently necessary first to derive the variational update equations, and then to implement them in application-specific code. each of these steps is both time consuming and error prone. in this paper we describe a general purpose inference engine called vibes (`variational inference for bayesian networks') which allows a wide variety of probabilistic models to be implemented and solved variationally without recourse to coding. new models are specified either through a simple script or via a graphical interface analogous to a drawing package. vibes then automatically generates and solves the variational equations. we illustrate the power and exibility of vibes using examples from bayesian mixture modelling citee id:868 citee title:winbugs { a bayesian modelling framework: concepts, structure and extensibility citee abstract:winbugs is a fully extensible modular framework for constructing and analysing bayesian full probability models. models may be specified either textually via the bugs language or pictorially using a graphical interface called doodlebugs. winbugs processes the model specification and constructs an object-oriented representation of the model. the software offers a user-interface, based on dialogue boxes and menu commands, through which the model may then be analysed using markov chain monte carlo techniques. in this paper we discuss how and why various modern computing concepts, such as object-orientation and run-time linking, feature in the software's design. we also discuss how the framework may be extended. it is possible to write specific applications that form an apparently seamless interface with winbugs for users with specialized requirements. it is also possible to interface with winbugs at a lower level by incorporating new object types that may be used by winbugs without knowledge of the modules in which they are implemented. neither of these types of extension require access to, or even recompilation of, the winbugs source-code surrounding text:in many ways the variational framework can be seen as a complementary approach to that of markov chain monte carlo (mcmc), with di erent strengths and weaknesses. for many years there has existed a powerful tool for tackling new problems using mcmc, called bugs (`bayesian inference using gibbs sampling') [***]<1>. in bugs a new probabilistic model, expressed as a directed acyclic graph, can be encoded using a simple scripting notation, and then samples can be drawn from the posterior distribution (given some data set of observed values) using gibbs sampling in a way that is largely automatic influence:3 type:1 pair index:385 citer id:382 citer title:vibes: a variational inference engine for bayesian networks citer abstract:in recent years variational methods have become a popular tool for approximate inference and learning in a wide variety of probabilistic models. for each new application, however, it is currently necessary first to derive the variational update equations, and then to implement them in application-specific code. each of these steps is both time consuming and error prone. in this paper we describe a general purpose inference engine called vibes (`variational inference for bayesian networks') which allows a wide variety of probabilistic models to be implemented and solved variationally without recourse to coding. new models are specified either through a simple script or via a graphical interface analogous to a drawing package. vibes then automatically generates and solves the variational equations. we illustrate the power and exibility of vibes using examples from bayesian mixture modelling citee id:829 citee title:propagation algorithms for variational bayesian learning citee abstract:variational approximations are becoming a widespread tool forbayesian learning of graphical modelsfi we provide some theoreticalresults for the variational updates in a very general family ofconjugate-exponential graphical modelsfi we show how the beliefpropagation and the junction tree algorithms can be used in theinference step of variational bayesian learningfi applying these resultsto the bayesian analysis of linear-gaussian state-space modelswe obtain a learning procedure that surrounding text:2. 1 factorized distributions for the purposes of building vibes we have focussed attention initially on distributions that factorize with respect to disjoint groups xi of variables q(xjv ) =yi qi(xi): (4) this approximation has been successfully used in many applications of variational methods [***, 5, 6]<1>. substituting (4) into (2) we can maximize l(q) variationally with respect to qi(xi) keeping all qj for j 6= i fixed. 2. 2 conjugate exponential models it has already been noted [***, 5]<1> that important simplifications to the variational update equations occur when the distributions of the latent variables, conditioned on their parameters, are drawn from the exponential family and are conjugate with respect to the prior distributions of the parameters. here we adopt a somewhat different viewpoint in that we make no distinction between latent variables and model parameters influence:2 type:2 pair index:386 citer id:382 citer title:vibes: a variational inference engine for bayesian networks citer abstract:in recent years variational methods have become a popular tool for approximate inference and learning in a wide variety of probabilistic models. for each new application, however, it is currently necessary first to derive the variational update equations, and then to implement them in application-specific code. each of these steps is both time consuming and error prone. in this paper we describe a general purpose inference engine called vibes (`variational inference for bayesian networks') which allows a wide variety of probabilistic models to be implemented and solved variationally without recourse to coding. new models are specified either through a simple script or via a graphical interface analogous to a drawing package. vibes then automatically generates and solves the variational equations. we illustrate the power and exibility of vibes using examples from bayesian mixture modelling citee id:193 citee title:a variational bayesian framework for graphical models citee abstract:this paper presents a novel practical framework for bayesian modelaveraging and model selection in probabilistic graphical modelsfiour approach approximates full posterior distributions over modelparameters and structures, as well as latent variables, in an analyticalmannerfi these posteriors fall out of a free-form optimizationprocedure, which naturally incorporates conjugate priorsfi unlikein large sample approximations, the posteriors are generally nongaussianand no hessian needs surrounding text:2. 1 factorized distributions for the purposes of building vibes we have focussed attention initially on distributions that factorize with respect to disjoint groups xi of variables q(xjv ) =yi qi(xi): (4) this approximation has been successfully used in many applications of variational methods [4, ***, 6]<1>. substituting (4) into (2) we can maximize l(q) variationally with respect to qi(xi) keeping all qj for j 6= i fixed. 2. 2 conjugate exponential models it has already been noted [4, ***]<1> that important simplifications to the variational update equations occur when the distributions of the latent variables, conditioned on their parameters, are drawn from the exponential family and are conjugate with respect to the prior distributions of the parameters. here we adopt a somewhat different viewpoint in that we make no distinction between latent variables and model parameters influence:2 type:2 pair index:387 citer id:382 citer title:vibes: a variational inference engine for bayesian networks citer abstract:in recent years variational methods have become a popular tool for approximate inference and learning in a wide variety of probabilistic models. for each new application, however, it is currently necessary first to derive the variational update equations, and then to implement them in application-specific code. each of these steps is both time consuming and error prone. in this paper we describe a general purpose inference engine called vibes (`variational inference for bayesian networks') which allows a wide variety of probabilistic models to be implemented and solved variationally without recourse to coding. new models are specified either through a simple script or via a graphical interface analogous to a drawing package. vibes then automatically generates and solves the variational equations. we illustrate the power and exibility of vibes using examples from bayesian mixture modelling citee id:867 citee title:variational principal components citee abstract:one of the central issues in the use of princi-pal component analysis (pca) for data mod-elling is that of choosing the appropriate num-ber of retained components. this problem wasrecently addressed through the formulation of abayesian treatment of pca (bishop, 1999a) interms of a probabilistic latent variable model. acentral feature of this approach is that the effec-tive dimensionality of the latent space (equiv-alent to the number of retained principal com-ponents) is determined automatically as partof the bayesian inference procedure. in com-mon with most non-trivial bayesian models,however, the required marginalizations are an-alytically intractable, and so an approximationscheme based on a local gaussian representa-tion of the posterior distribution was employed.in this paper we develop an alternative, varia-tional formulation of bayesian pca, based ona factorial representation of the posterior distri-bution. this approach is computationally effi-cient, and unlike other approximation schemes,it maximizes a rigorous lower bound on themarginal log probability of the observed data. surrounding text:2. 1 factorized distributions for the purposes of building vibes we have focussed attention initially on distributions that factorize with respect to disjoint groups xi of variables q(xjv ) =yi qi(xi): (4) this approximation has been successfully used in many applications of variational methods [4, 5, ***]<1>. substituting (4) into (2) we can maximize l(q) variationally with respect to qi(xi) keeping all qj for j 6= i fixed. 3. 1 example: bayesian mixture models we illustrate vibes using a bayesian model for a mixture of m probabilistic pca distributions, each having maximum intrinsic dimensionality of q, with a sparse prior [***]<1>, for which the vibes implementation is shown in figure 2. here there are n observations of the vector t whose dimensionality is d, as indicated by the plates influence:2 type:2 pair index:388 citer id:382 citer title:vibes: a variational inference engine for bayesian networks citer abstract:in recent years variational methods have become a popular tool for approximate inference and learning in a wide variety of probabilistic models. for each new application, however, it is currently necessary first to derive the variational update equations, and then to implement them in application-specific code. each of these steps is both time consuming and error prone. in this paper we describe a general purpose inference engine called vibes (`variational inference for bayesian networks') which allows a wide variety of probabilistic models to be implemented and solved variationally without recourse to coding. new models are specified either through a simple script or via a graphical interface analogous to a drawing package. vibes then automatically generates and solves the variational equations. we illustrate the power and exibility of vibes using examples from bayesian mixture modelling citee id:850 citee title:structured variational distributions in vibes citee abstract:variational methods are becoming increasinglypopular for the approximate solution of complexprobabilistic models in machine learning,computer vision, information retrieval and manyother fieldsfi unfortunately, for every new applicationit is necessary first to derive the specificforms of the variational update equations forthe particular probabilistic model being used, andthen to implement these equations in applicationspecificsoftwarefi each of these steps is bothtime surrounding text:4 discussion our early experiences with vibes have shown that it dramatically simplifies the construction and testing of new variational models, and readily allows a range of alternative models to be evaluated on a given problem. currently we are extending vibes to cater for a broader range of variational distributions by allowing the user to specify a q distribution defined over a subgraph of the true graph [***]<1>. finally, there are many possible extensions to the basic vibes we have described figure 3: as in figure 2 but with the vector of hyper-parameters moved outside the m `plate' influence:1 type:2 pair index:389 citer id:383 citer title:simplicial mixtures of markov chains: distributed modelling of dynamic user profiles citer abstract:to provide a compact generative representation of the sequential activity of a number of individuals within a group there is a tradeoff between the de.ni-tion of individual speci.c and global models. this paper proposes a linear-time distributed model for .nite state symbolic sequences representing traces of individual user activity by making the assumption that heterogeneous user be-havior may be ‘explainedâ€?by a relatively small number of structurally simple common behavioral patterns which may interleave randomly in a user-speci.c proportion. the results of an empirical study on three different sources of user traces indicates that this modelling approach provides an ef.cient representation scheme, re.ected by improved prediction performance as well as providing low-complexity and intuitively interpretable representations citee id:428 citee title:unsupervised learning by probabilistic latent semantic analysis citee abstract:this paper presents a novel statistical method for factor analysis of binary and count data whichis closely related to a technique known as latent semantic analysis. in contrast to the latter method whichstems from linear algebra and performs a singular value decomposition of co-occurrence tables, the proposedtechnique uses a generative latent class model to perform a probabilistic mixture decomposition. this resultsin a more principled approach with a solid foundation in statistical inference. more precisely, we propose tomake use of a temperature controlled version of the expectation maximization algorithm for model fitting, whichhas shown excellent performance in practice. probabilistic latent semantic analysis has many applications, mostprominently in information retrieval, natural language processing, machine learning from text, and in related areas.the paper presents perplexity results for different types of text and linguistic data collections and discusses anapplication in automated document indexing. the experiments indicate substantial and consistent improvementsof the probabilistic method over standard latent semantic analysis. surrounding text:the resulting model is thus a distributed dynamic model which bene. ts from the recent technical developments in distributed parts based modelling of static vectorial data [7, 9, ***, 4, 8, 2]<1>, with various applications including image decomposition, document modelling, information retrieval and collaborative . ltering. cation of this model, which also highlights the close relationship between two existing related models, speci. cally the probabilistic latent semantic analysis (plsa) [***]<1> and lda [4]<1> as being instances of the same theoretical model and differing only in the estimation procedure adopted. 2. note that both (3) and (4) require an elementwise matrix multiplication and division so these iterations will scale linearly with the number of non-zero state-transition counts. it is interesting to note that the map estimator under a uniform dirichlet distribution exactly recovers the aspect mixture model of [***]<1> as a special case of the map estimated lda model. 2 influence:3 type:1 pair index:390 citer id:383 citer title:simplicial mixtures of markov chains: distributed modelling of dynamic user profiles citer abstract:to provide a compact generative representation of the sequential activity of a number of individuals within a group there is a tradeoff between the de.ni-tion of individual speci.c and global models. this paper proposes a linear-time distributed model for .nite state symbolic sequences representing traces of individual user activity by making the assumption that heterogeneous user be-havior may be ‘explainedâ€?by a relatively small number of structurally simple common behavioral patterns which may interleave randomly in a user-speci.c proportion. the results of an empirical study on three different sources of user traces indicates that this modelling approach provides an ef.cient representation scheme, re.ected by improved prediction performance as well as providing low-complexity and intuitively interpretable representations citee id:691 citee title:maximum likelihood estimation of dirichlet distributions citee abstract:dirichlet distributions are commonly used as priors over proportional data. in this paper, i will introduce this distribution, discuss why it is useful, and compare implementations of 4 different methods for estimating its parameters from observed data surrounding text:nal parameter is that of the prior dirichlet distribution, maximum likelihood estimation yields map the estimated distribution parameters . given the n[***, 4]<1>. note that both (3) and (4) require an elementwise matrix multiplication and division so these iterations will scale linearly with the number of non-zero state-transition counts. given the variational parameters . n are estimated using standard methods [***, 4]<1>. 2 influence:3 type:3 pair index:391 citer id:383 citer title:simplicial mixtures of markov chains: distributed modelling of dynamic user profiles citer abstract:to provide a compact generative representation of the sequential activity of a number of individuals within a group there is a tradeoff between the de.ni-tion of individual speci.c and global models. this paper proposes a linear-time distributed model for .nite state symbolic sequences representing traces of individual user activity by making the assumption that heterogeneous user be-havior may be ‘explainedâ€?by a relatively small number of structurally simple common behavioral patterns which may interleave randomly in a user-speci.c proportion. the results of an empirical study on three different sources of user traces indicates that this modelling approach provides an ef.cient representation scheme, re.ected by improved prediction performance as well as providing low-complexity and intuitively interpretable representations citee id:718 citee title:maximum likelihood estimation in the mover-stayer model citee abstract:the discrete time mover-stayer model, a special mixture of two independent markov chains, has been widely used in modeling the dynamics of social processes. the problem of maximum likelihood estimation of its parameters from the data, however, which consist of a sample of independent realizations of this process, has not been considered in the literature. i present a maximum likelihood procedure for the estimation of the parameters of the mover-stayer model and develop a recursive method of computation of maximum likelihood estimators that is very simple to implement. i also verify that obtained maximum likelihood estimators are strongly consistent. i show that the two estimators of the parameters of the mover-stayer model previously proposed in the literature are special cases of the maximum likelihood estimator derived in this article, that is, they coincide with the maximum likelihood estimator under special conditions. i thus explain the interconnection between existing estimators. i also present a numerical comparison of the three estimators. finally, i illustrate the application of the maximum likelihood estimators to testing the hypothesis that the markov chain describes the data against the hypothesis that the mover-stayer model describes the data. surrounding text:to capture the possible heterogeneous nature of the observed sequences a model with a number of differing generating processes needs to be considered. indeed the notion of a heterogeneous population, characterized for example by occupational mobility and consumer brand preferences, has been captured in the mover-stayer model [***]<3>. this model is a discrete time stochastic process that is a two component mixture of influence:3 type:3 pair index:392 citer id:383 citer title:simplicial mixtures of markov chains: distributed modelling of dynamic user profiles citer abstract:to provide a compact generative representation of the sequential activity of a number of individuals within a group there is a tradeoff between the de.ni-tion of individual speci.c and global models. this paper proposes a linear-time distributed model for .nite state symbolic sequences representing traces of individual user activity by making the assumption that heterogeneous user be-havior may be ‘explainedâ€?by a relatively small number of structurally simple common behavioral patterns which may interleave randomly in a user-speci.c proportion. the results of an empirical study on three different sources of user traces indicates that this modelling approach provides an ef.cient representation scheme, re.ected by improved prediction performance as well as providing low-complexity and intuitively interpretable representations citee id:204 citee title:algorithms for non-negative matrix factorization citee abstract:non-negative matrix factorization (nmf) has previously been shown to be a useful decomposition for multivariate data. two different multi-plicative algorithms for nmf are analyzed. they differ only slightly in the multiplicative factor used in the update rules. one algorithm can be shown to minimize the conventional least squares error while the other minimizes the generalized kullback-leibler divergence. the monotonic convergence of both algorithms can be proven using an auxiliary func-tion analogous to that used for proving convergence of the expectation-maximization algorithm. the algorithms can also be interpreted as diag-onally rescaled gradient descent, where the rescaling factor is optimally chosen to ensure convergence surrounding text:the resulting model is thus a distributed dynamic model which bene. ts from the recent technical developments in distributed parts based modelling of static vectorial data [***, 9, 1, 4, 8, 2]<1>, with various applications including image decomposition, document modelling, information retrieval and collaborative . ltering. forming a lagrangian from the above to enforce the constraint that map is a sample point from a dirichlet variable then taking derivatives with respect to the λmap k, a convergent series of updates λt knis obtained where the superscript denotes the t’th iteration. as in [***]<1>, for each observed sequence in the sample a map value for the variable is iteratively estimated by the following multiplicative updates ss λ. kn=(αk influence:3 type:1 pair index:393 citer id:383 citer title:simplicial mixtures of markov chains: distributed modelling of dynamic user profiles citer abstract:to provide a compact generative representation of the sequential activity of a number of individuals within a group there is a tradeoff between the de.ni-tion of individual speci.c and global models. this paper proposes a linear-time distributed model for .nite state symbolic sequences representing traces of individual user activity by making the assumption that heterogeneous user be-havior may be ‘explainedâ€?by a relatively small number of structurally simple common behavioral patterns which may interleave randomly in a user-speci.c proportion. the results of an empirical study on three different sources of user traces indicates that this modelling approach provides an ef.cient representation scheme, re.ected by improved prediction performance as well as providing low-complexity and intuitively interpretable representations citee id:534 citee title:expectation-propogation for the generative aspect model citee abstract:: the generative aspect model is an extension ofthe multinomial model for text that allows wordprobabilities to vary stochastically across documentsfi surrounding text:the resulting model is thus a distributed dynamic model which bene. ts from the recent technical developments in distributed parts based modelling of static vectorial data [7, 9, 1, 4, ***, 2]<1>, with various applications including image decomposition, document modelling, information retrieval and collaborative . ltering. ltering. consistent generative semantics similar to the recently introduced latent dirichlet allocation (lda) [4]<1> will be adopted and by analogy with [***]<1> the resulting model will be referred to as a simplicial mixture. 2 simplicial mixtures of markov chains assume that a sequence of lsymbols s1s2,,sl, denoted by s, can be drawn from a dic··· tionary s by a process kwhich has initial state probability p1(sk) and has s 2 state transi-tion probabilities denoted by t(sisjk)where i,j=1,. these are then combined with the individual state-transition probabilities tk, which are model parameters to be estimated, and yield p k the symbol transition probabilities tij=k=1tijkλk. the overall probability for a sequence snunder such a mixture, which we shall now refer to as a simplicial mixture [***]<1>, denoted as p(snt,. )is equal to z z()r kn yy ij ssx p(snt,)d( influence:3 type:1 pair index:394 citer id:383 citer title:simplicial mixtures of markov chains: distributed modelling of dynamic user profiles citer abstract:to provide a compact generative representation of the sequential activity of a number of individuals within a group there is a tradeoff between the de.ni-tion of individual speci.c and global models. this paper proposes a linear-time distributed model for .nite state symbolic sequences representing traces of individual user activity by making the assumption that heterogeneous user be-havior may be ‘explainedâ€?by a relatively small number of structurally simple common behavioral patterns which may interleave randomly in a user-speci.c proportion. the results of an empirical study on three different sources of user traces indicates that this modelling approach provides an ef.cient representation scheme, re.ected by improved prediction performance as well as providing low-complexity and intuitively interpretable representations citee id:509 citee title:ensemble learning citee abstract:introductionwhen we say we are making a model of a system, we are setting up a tool which canbe used to make inferences, predictions and decisionsfi each model can be seen as ahypothesis, or explanation, which makes assertions about the quantities which aredirectly observable and which can only be inferred from their effect on observablequantitiesfiin the bayesian framework, knowledge is contained in the conditional probabilitydistributions of the modelsfi we can use bayes" theorem to surrounding text:1 variational parameter estimation and inference while being optimal in analyzing an existing data set, map estimators are notoriously prone to over. tting, especially where there is a paucity of available data [***]<1> and so the variational bayes (vb) approach detailed in [4]<1> can be adopted. the above (2) can be further lower-bounded by noting that sskij  tijk  xxx edγ[logp(snt,)]â‰?rnqijkedγ log λk qijk i=1j=1k=1 employing this in (2), solving for qijkand γknthen combining yields the following multiplicative iterative update for the sequence speci influence:3 type:3 pair index:395 citer id:384 citer title:integrating topics and syntax citer abstract:statistical approaches to language learning typically focus on either short-range syntactic dependencies or long-range semantic dependencies between words. we present a generative model that uses both kinds of dependencies, and is capable of simultaneously finding syntactic classes and semantic topics despite having no knowledge of syntax or semantics beyond statistical dependency. this model is competitive on tasks like part-of-speech tagging and document classification with models that exclusively use short- and long-range dependencies respectively citee id:571 citee title:fractionating language: different neural subsytems with different sensitive periods citee abstract:theoretical considerations and psycholinguistic studies have alternately provided criticism and support for the proposal that semantic and grammatical functions are distinct subprocesses within the language domain. neurobiological evidence concerning this hypothesis was sought by (1) comparing, in normal adults, event-related brain potentials (erps) elicited by words that provide primarily semantic information (open class) and grammatical information (closed class) and (2) comparing the effects of the altered early language experience of congenitally deaf subjects on erps to open and closed class words. in normal-hearing adults, the different word types elicited qualitatively different erps that were compatible with the hypothesized different roles of the word classes in language processing. in addition, whereas erp indices of semantic processing were virtually identical in deaf and hearing subjects, those linked to grammatical processes were markedly different in deaf and bearing subjects. the results suggest that nonidentical neural systems with different developmental vulnerabilities mediate these different aspects of language. more generally, these results provide neurobiological support for the distinction between semantic and grammatical functions. surrounding text:1 introduction a word can appear in a sentence for two reasons: because it serves a syntactic function, or because it provides semantic content. words that play different roles are treated differently in human language processing: function and content words produce different patterns of brain activity [***]<3>, and have different developmental trends [2]<3>. so, how might a language learner discover the syntactic and semantic classes of words influence:3 type:3 pair index:396 citer id:384 citer title:integrating topics and syntax citer abstract:statistical approaches to language learning typically focus on either short-range syntactic dependencies or long-range semantic dependencies between words. we present a generative model that uses both kinds of dependencies, and is capable of simultaneously finding syntactic classes and semantic topics despite having no knowledge of syntax or semantics beyond statistical dependency. this model is competitive on tasks like part-of-speech tagging and document classification with models that exclusively use short- and long-range dependencies respectively citee id:445 citee title:distributional information: a powerful cue for acquiring syntactic categories citee abstract:many theorists have dismissed a priori the idea that distributional information could play a significant role in syntactic category acquisition. we demonstrate empirically that such information provides a powerful cue to syntactic category membership, which can be exploited by a variety of simple, psychologically plausible mechanisms. we present a range of results using a large corpus of child-directed speech and explore their psychological implications. while our results show that a considerable amount of information concerning the syntactic categories can be obtained from distributional information alone, we stress that many other sources of information may also be potential contributors to the identification of syntactic classes. surrounding text:so, how might a language learner discover the syntactic and semantic classes of words. cognitive scientists have shown that unsupervised statistical methods can be used to identify syntactic classes [***]<2> and to extract a representation of semantic content [4]<2>, but none of these methods captures the interaction between function and content words, or even recognizes that these roles are distinct. here we explore how statistical learning, with no prior knowledge of either syntax or semantics, can discover the difference between function and content words and simultaneously organize words into syntactic classes and semantic topics. this simplifies the problem, as most words belong to a single class. however, genuinely unsupervised recovery of parts-of-speech has been used to assess statistical models of language learning, such as distributional clustering [***]<2>. we assessed tagging performance on the brown corpus, using two tagsets. this is partly because all words that vary strongly in frequency across contexts get assigned to the semantic class in the composite model, so it misses some of the fine-grained distinctions expressed in the full tagset. both the hmm and the composite model performed better than the distributional clustering method described in [***]<2>, which was used to form the 1000 most frequent words in brown into 19 clusters. figure 6 compares this clustering with the classes for those words from the hmm and composite models trained on brown influence:1 type:2 pair index:397 citer id:384 citer title:integrating topics and syntax citer abstract:statistical approaches to language learning typically focus on either short-range syntactic dependencies or long-range semantic dependencies between words. we present a generative model that uses both kinds of dependencies, and is capable of simultaneously finding syntactic classes and semantic topics despite having no knowledge of syntax or semantics beyond statistical dependency. this model is competitive on tasks like part-of-speech tagging and document classification with models that exclusively use short- and long-range dependencies respectively citee id:178 citee title:a solution to plato’s problem: the latent semantic analysis theory of acquisition, induction, and representation of knowledge citee abstract:how do people know as much as they do with as little information as they get? the problem takes many forms; learning vocabulary from text is an especially dramatic and convenient case for research. a new general theory of acquired similarity and knowledge representation, latent semantic analysis (lsa), is presented and used to successfully simulate such learning and several other psycholinguistic phenomena. by inducing global knowledge indirectly from local co-occurrence data in a large body of representative text, lsa acquired knowledge about the full vocabulary of english at a comparable rate to school-children. lsa uses no prior linguistic or perceptual similarity knowledge; it is based solely on a general mathematical learning method that achieves powerful inductive effects by extracting the right number of dimensions (e.g., 300) to represent objects and contexts. relations to other theories, phenomena, and problems are sketched. surrounding text:so, how might a language learner discover the syntactic and semantic classes of words. cognitive scientists have shown that unsupervised statistical methods can be used to identify syntactic classes [3]<2> and to extract a representation of semantic content [***]<2>, but none of these methods captures the interaction between function and content words, or even recognizes that these roles are distinct. here we explore how statistical learning, with no prior knowledge of either syntax or semantics, can discover the difference between function and content words and simultaneously organize words into syntactic classes and semantic topics influence:1 type:2 pair index:398 citer id:384 citer title:integrating topics and syntax citer abstract:statistical approaches to language learning typically focus on either short-range syntactic dependencies or long-range semantic dependencies between words. we present a generative model that uses both kinds of dependencies, and is capable of simultaneously finding syntactic classes and semantic topics despite having no knowledge of syntax or semantics beyond statistical dependency. this model is competitive on tasks like part-of-speech tagging and document classification with models that exclusively use short- and long-range dependencies respectively citee id:570 citee title:foundations of statistical natural language processing citee abstract:statistical approaches to processing natural language text have become dominant in recent years. this foundational text is the first comprehensive introduction to statistical natural language processing (nlp) to appear. the book contains all the theory and algorithms needed for building nlp tools. it provides broad but rigorous coverage of mathematical and linguistic foundations, as well as detailed discussion of statistical methods, allowing students and researchers to construct their own implementations. the book covers collocation finding, word sense disambiguation, probabilistic parsing, information retrieval, and other applications. surrounding text:g. , [***]<1>). probabilistic models of language are typically driven exclusively by either short-range or long-range dependencies between words. g. , [***]<1>) generate documents purely based on syntactic relations among unobserved word classes, while “bag-of-wordsâ€?models like naive bayes or topic models (e. g influence:3 type:2 pair index:399 citer id:384 citer title:integrating topics and syntax citer abstract:statistical approaches to language learning typically focus on either short-range syntactic dependencies or long-range semantic dependencies between words. we present a generative model that uses both kinds of dependencies, and is capable of simultaneously finding syntactic classes and semantic topics despite having no knowledge of syntax or semantics beyond statistical dependency. this model is competitive on tasks like part-of-speech tagging and document classification with models that exclusively use short- and long-range dependencies respectively citee id:164 citee title:latent dirichlet allocation citee abstract:we describe latent dirichlet allocation (lda), a generative probabilistic model for collections of discrete data such as text corpora. lda is a three-level hierarchical bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. in the context of text modeling, the topic probabilities provide an explicit representation of a document. we present efficient approximate inference techniques based on variational methods and an em algorithm for empirical bayes parameter estimation. we report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic lsi model surrounding text:g. , [***]<1>) generate documents based on semantic correlations between words, independent of word order. by considering only one of the factors influencing the words that appear in documents, these approaches are forced to assess all words on a single criterion: an hmm will group nouns together, as they play the same syntactic role even though they vary across contexts, and a topic model will assign determiners to topics, even though they bear little semantic content. in addition to running the composite model with t = 200 and c = 20, we examined two special cases: t = 200, c = 2, being a model where the only hmm classes are the start/end and semantic classes, and thus equivalent to latent dirichlet allocation (lda. [***]<1>). and t = 1, c = 20, being an hmm in which the semantic class distribution does not vary across documents, and simply has a different hyperparameter from the other classes influence:3 type:1 pair index:400 citer id:384 citer title:integrating topics and syntax citer abstract:statistical approaches to language learning typically focus on either short-range syntactic dependencies or long-range semantic dependencies between words. we present a generative model that uses both kinds of dependencies, and is capable of simultaneously finding syntactic classes and semantic topics despite having no knowledge of syntax or semantics beyond statistical dependency. this model is competitive on tasks like part-of-speech tagging and document classification with models that exclusively use short- and long-range dependencies respectively citee id:659 citee title:towards better integration of semantic predictors in statistical language modeling citee abstract:we introduce a number of techniques designed to help integrate semantic knowledge with n-gram language models for automatic speech recognition. our techniques allow us to integrate latent semantic analysis (lsa), a word-similarity algorithm based on word co-occurrence information, with n-gram models. while lsa is good at predicting content words which are coherent with the rest of a text, it is a bad predictor of frequent words, has a low dynamic range, and is inaccurate when combined linearly with n-grams. we show that modifying the dynamic range, applying a per-word confidence metric, and using geometric rather than linear combinations with n-grams produces a more robust language model which has a lower perplexity on a wall street journal testset than a baseline n-gram model surrounding text:g. [***]<2>), each word would exhibit both short-range and long-range dependencies. consideration of the structure of language reveals that neither of these models is appropriate influence:1 type:2 pair index:401 citer id:384 citer title:integrating topics and syntax citer abstract:statistical approaches to language learning typically focus on either short-range syntactic dependencies or long-range semantic dependencies between words. we present a generative model that uses both kinds of dependencies, and is capable of simultaneously finding syntactic classes and semantic topics despite having no knowledge of syntax or semantics beyond statistical dependency. this model is competitive on tasks like part-of-speech tagging and document classification with models that exclusively use short- and long-range dependencies respectively citee id:426 citee title:finding scientific topics citee abstract:a first step in identifying the content of a document is determining which topics that document addresses. we describe a generative model for documents, introduced by blei, ng, and jordan , in which each document is generated by choosing a distribution over topics and then choosing each word in the document from a topic selected according to this distribution. we then present a markov chain monte carlo algorithm for inference in this model. we use this algorithm to analyze s from pnas by using bayesian model selection to establish the number of topics. we show that the extracted topics capture meaningful structure in the data, consistent with the class designations provided by the authors of the articles, and outline further applications of this analysis, including identifying hot topics by examining temporal dynamics and tagging abstracts to illustrate semantic content. surrounding text:however, em produces poor results with topic models, which have many parameters and many local maxima. consequently, recent work has focused on approximate inference algorithms [6, ***]<2>. we will use markov chain monte carlo (mcmc influence:1 type:1 pair index:402 citer id:384 citer title:integrating topics and syntax citer abstract:statistical approaches to language learning typically focus on either short-range syntactic dependencies or long-range semantic dependencies between words. we present a generative model that uses both kinds of dependencies, and is capable of simultaneously finding syntactic classes and semantic topics despite having no knowledge of syntax or semantics beyond statistical dependency. this model is competitive on tasks like part-of-speech tagging and document classification with models that exclusively use short- and long-range dependencies respectively citee id:660 citee title:markov chain monte carlo in practice citee abstract:markov chain monte carlo (mcmc) methods make possi-ble the use of flexible bayesian models that would other-wise be computationally infeasible. in recent years, a greatvariety of such applications have been described in the lit-erature. applied statisticians who are new to these methodsmay have several questions and concerns, however: howmuch effort and expertise are needed to design and use amarkov chain sampler? how much confidence can one havein the answers that mcmc produces? how does the use ofmcmc affect the rest of the model-building process? atthe joint statistical meetings in august, 1996, a panel ofexperienced mcmc users discussed these and other issues,as well as various tricks of the trade. this article is anedited recreation of that discussion. its purpose is to offeradvice and guidance to novice users of mcmcand to not-so-novice users as well. topics include building confidencein simulation results, methods for speeding and assessingconvergence, estimating standard errors, identification ofmodels for which good mcmc algorithms exist, and thecurrent state of software development. surrounding text:we will use markov chain monte carlo (mcmc. see [***]<1>) to perform full bayesian inference in this model, sampling from a posterior distribution over assignments of words to classes and topics. we assume that the document-specific distributions over topics, , are drawn from a dirichlet( ) distribution, the topic distributions (z) are drawn from a dirichlet(fi) distribution, the rows of the transition matrix for the hmm are drawn from a dirichlet( ) distribution, the class distributions (c) are drawn from a dirichlet(ffi) distribution, and all dirichlet distributions are symmetric. we assume that the document-specific distributions over topics, , are drawn from a dirichlet( ) distribution, the topic distributions (z) are drawn from a dirichlet(fi) distribution, the rows of the transition matrix for the hmm are drawn from a dirichlet( ) distribution, the class distributions (c) are drawn from a dirichlet(ffi) distribution, and all dirichlet distributions are symmetric. we use gibbs sampling to draw iteratively a topic assignment zi and class assignment ci for each word wi in the corpus (see [8, ***]<1>). given the words w, the class assignments c, the other topic assignments z influence:2 type:1 pair index:403 citer id:384 citer title:integrating topics and syntax citer abstract:statistical approaches to language learning typically focus on either short-range syntactic dependencies or long-range semantic dependencies between words. we present a generative model that uses both kinds of dependencies, and is capable of simultaneously finding syntactic classes and semantic topics despite having no knowledge of syntax or semantics beyond statistical dependency. this model is competitive on tasks like part-of-speech tagging and document classification with models that exclusively use short- and long-range dependencies respectively citee id:273 citee title:bayes factors citee abstract:in a 1935 paper, and in his book theory of probability, jeffreys developed a methodologyfor quantifying the evidence in favor of a scientific theoryfi the centerpiece was a number,now called the bayes factor, which is the posterior odds of the null hypothesis when the priorprobability on the null is one-halffi although there has been much discussion of bayesianhypothesis testing in the context of criticism of p -values, less attention has been given to thebayes factor as a practical tool of surrounding text:3. 3 marginal probabilities we assessed the marginal probability of the data under each model, p(w), using the harmonic mean of the likelihoods over the last 2000 iterations of sampling, a standard method for evaluating bayes factors via mcmc [***]<1>. this probability takes into account the complexity of the models, as more complex models are penalized by integrating over a latent space with larger regions of low probability influence:3 type:1 pair index:404 citer id:384 citer title:integrating topics and syntax citer abstract:statistical approaches to language learning typically focus on either short-range syntactic dependencies or long-range semantic dependencies between words. we present a generative model that uses both kinds of dependencies, and is capable of simultaneously finding syntactic classes and semantic topics despite having no knowledge of syntax or semantics beyond statistical dependency. this model is competitive on tasks like part-of-speech tagging and document classification with models that exclusively use short- and long-range dependencies respectively citee id:337 citee title:comparing partitions citee abstract:the problem of comparing two different partitions of a finite set of objects reappears continually in the clustering literature. we begin by reviewing a well-known measure of partition correspondence often attributed to rand (1971), discuss the issue of correcting this index for chance, and note that a recent normalization strategy developed by morey and agresti (1984) and adopted by others (e.g., miligan and cooper 1985) is based on an incorrect assumption. then, the general problem of comparing partitions is approached indirectly by assessing the congruence of two proximity matrices using a simple cross-product measure. they are generated from corresponding partitions using various scoring rules. special cases derivable include traditionally familiar statistics and/or ones tailored to weight certain object pairs differentially. finally, we propose a measure based on the comparison of object triples having the advantage of a probabilistic interpretation in addition to being corrected for chance (i.e., assuming a constant value under a reasonable null hypothesis) and bounded between 1 surrounding text:the other set collapsed these tags into ten high-level designations: adjective, adverb, conjunction, determiner, foreign, noun, preposition, pronoun, punctuation, and verb. we evaluated tagging performance by using the adjusted rand index [***]<3>, to measure the concordance between the tags and the class assignments of the hmm and composite models in the 20th sample. the adjusted rand index ranges from influence:3 type:3 pair index:405 citer id:386 citer title:discovering object categories in image collections citer abstract:given a set of images containing multiple object categories, we seek to discover those categories and their image locations without supervision. we achieve this using generative models from the statistical text literature: probabilistic latent semantic analysis (plsa), and latent dirichlet allocation (lda). in text analysis these are used to discover topics in a corpus using the bag-of-words document representation. here we discover topics as object categories, so that an image containing instances of several categories is modelled as a mixture of topics. the models are applied to images by using a visual analogue of a word, formed by vector quantizing sift like region descriptors. we investigate a set of increasingly demanding scenarios, starting with image sets containing only two object categories through to sets containing multiple categories (including airplanes, cars, faces, motorbikes, spotted cats) and background clutter. the object categories sample both intra-class and scale variation, and both the categories and their approximate spatial layout are found without supervision. we also demonstrate classi.cation of unseen images and images containing multiple objects. performance of the proposed unsupervised method is compared to the semi-supervised approach of .1 1this work was sponsored in part by the eu project cogvisys, the university of oxford, shell oil, and the national geospatial-intelligence agency citee id:421 citee title:matching words and pictures citee abstract:: we present a new an very rich approach for moeling multi-moal ata sets,focusing on the specific case of segmente images with associate textfi learning thejoint istribution of image regions an wors has many applicationsfi we consier inetail preicting wors associate with whole images (auto-annotation) ancorresponing to particular image regions (region naming)fi auto-annotation mighthelp organize an access large collections of imagesfi region naming is a moel ofobject surrounding text:introduction common approaches to object recognition involve some form of supervision. this may range from specifying the objects location and segmentation, as in face detec-tion [17, 24]<2>, to providing only auxiliary data indicating the objects identity [***, 5, 7, 25]<2>. for a large dataset, any annotation is expensive, or may introduce unforeseen bi-ases influence:2 type:2 pair index:406 citer id:386 citer title:discovering object categories in image collections citer abstract:given a set of images containing multiple object categories, we seek to discover those categories and their image locations without supervision. we achieve this using generative models from the statistical text literature: probabilistic latent semantic analysis (plsa), and latent dirichlet allocation (lda). in text analysis these are used to discover topics in a corpus using the bag-of-words document representation. here we discover topics as object categories, so that an image containing instances of several categories is modelled as a mixture of topics. the models are applied to images by using a visual analogue of a word, formed by vector quantizing sift like region descriptors. we investigate a set of increasingly demanding scenarios, starting with image sets containing only two object categories through to sets containing multiple categories (including airplanes, cars, faces, motorbikes, spotted cats) and background clutter. the object categories sample both intra-class and scale variation, and both the categories and their approximate spatial layout are found without supervision. we also demonstrate classi.cation of unseen images and images containing multiple objects. performance of the proposed unsupervised method is compared to the semi-supervised approach of .1 1this work was sponsored in part by the eu project cogvisys, the university of oxford, shell oil, and the national geospatial-intelligence agency citee id:422 citee title:minimum complexity density estimation citee abstract:the authors introduce an index of resolvability that is proved to bound the rate of convergence of minimum complexity density estimators as well as the information-theoretic redundancy of the corresponding total description length. the results on the index of resolvability demonstrate the statistical effectiveness of the minimum description-length principle as a method of inference. the minimum complexity estimator converges to true density nearly as fast as an estimator based on prior knowledge of the true subclass of densities. interpretations and basic properties of minimum complexity estimators are discussed. some regression and classification problems that can be examined from the minimum description-length framework are considered surrounding text:even though the new categories all had substantially fewer im-ages (around 200), the results are still encouraging. discussion: in the experiments it was necessary to spec-ify the number of topics k, however bayesian [21]<2> or mini-mum complexity methods [***]<2> can be used to infer the num-ber of topics implied by a corpus. while designing these experiments, we grew to appre-ciate the many dif influence:3 type:3 pair index:407 citer id:386 citer title:discovering object categories in image collections citer abstract:given a set of images containing multiple object categories, we seek to discover those categories and their image locations without supervision. we achieve this using generative models from the statistical text literature: probabilistic latent semantic analysis (plsa), and latent dirichlet allocation (lda). in text analysis these are used to discover topics in a corpus using the bag-of-words document representation. here we discover topics as object categories, so that an image containing instances of several categories is modelled as a mixture of topics. the models are applied to images by using a visual analogue of a word, formed by vector quantizing sift like region descriptors. we investigate a set of increasingly demanding scenarios, starting with image sets containing only two object categories through to sets containing multiple categories (including airplanes, cars, faces, motorbikes, spotted cats) and background clutter. the object categories sample both intra-class and scale variation, and both the categories and their approximate spatial layout are found without supervision. we also demonstrate classi.cation of unseen images and images containing multiple objects. performance of the proposed unsupervised method is compared to the semi-supervised approach of .1 1this work was sponsored in part by the eu project cogvisys, the university of oxford, shell oil, and the national geospatial-intelligence agency citee id:164 citee title:latent dirichlet allocation citee abstract:we describe latent dirichlet allocation (lda), a generative probabilistic model for collections of discrete data such as text corpora. lda is a three-level hierarchical bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. in the context of text modeling, the topic probabilities provide an explicit representation of a document. we present efficient approximate inference techniques based on variational methods and an em algorithm for empirical bayes parameter estimation. we report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic lsi model surrounding text:documents are images and we quantize local appearance descriptions to form visual words [4, 18, 20, 26]<2>. the two models we investigate are the probabilistic latent semantic analysis (plsa) of hof-mann [9, 10]<1>, and the latent dirichlet allocation (lda) of blei et al$ [***]<1>. both use the bag of words model, where positional relationships between features are ignored influence:1 type:1 pair index:408 citer id:386 citer title:discovering object categories in image collections citer abstract:given a set of images containing multiple object categories, we seek to discover those categories and their image locations without supervision. we achieve this using generative models from the statistical text literature: probabilistic latent semantic analysis (plsa), and latent dirichlet allocation (lda). in text analysis these are used to discover topics in a corpus using the bag-of-words document representation. here we discover topics as object categories, so that an image containing instances of several categories is modelled as a mixture of topics. the models are applied to images by using a visual analogue of a word, formed by vector quantizing sift like region descriptors. we investigate a set of increasingly demanding scenarios, starting with image sets containing only two object categories through to sets containing multiple categories (including airplanes, cars, faces, motorbikes, spotted cats) and background clutter. the object categories sample both intra-class and scale variation, and both the categories and their approximate spatial layout are found without supervision. we also demonstrate classi.cation of unseen images and images containing multiple objects. performance of the proposed unsupervised method is compared to the semi-supervised approach of .1 1this work was sponsored in part by the eu project cogvisys, the university of oxford, shell oil, and the national geospatial-intelligence agency citee id:423 citee title:visual catego-rization with bags of keypoints citee abstract:we present a novel method for generic visual categorization: the problem of identifying the object content of natural images while generalizing across variations inherent to the object class. this bag of keypoints method is based on vector quantization of affine invariant descriptors of image patches. we propose and compare two alternative implementations using different classifiers: na?ve bayes and svm. the main advantages of the method are that it is simple, computationally efficient and intrinsically invariant. we present results for simultaneously classifying seven semantic visual categories. these results clearly demonstrate that the method is robust to background clutter and produces good categorization accuracy even without exploiting geometric information. surrounding text:we ap-ply models used in statistical natural language processing to discover object categories and their image layout analo-gously to topic discovery in text. documents are images and we quantize local appearance descriptions to form visual words [***, 18, 20, 26]<2>. the two models we investigate are the probabilistic latent semantic analysis (plsa) of hof-mann [9, 10]<1>, and the latent dirichlet allocation (lda) of blei et al$ [3]<1>. others have used similar descriptors for object classi. cation [***, 15]<2>, but in a supervised setting. we compare the two statistical models with a control global texture model, similar to those proposed for pre-attentive vision [22]<2> and image retrieval [19]<2> influence:1 type:2 pair index:409 citer id:386 citer title:discovering object categories in image collections citer abstract:given a set of images containing multiple object categories, we seek to discover those categories and their image locations without supervision. we achieve this using generative models from the statistical text literature: probabilistic latent semantic analysis (plsa), and latent dirichlet allocation (lda). in text analysis these are used to discover topics in a corpus using the bag-of-words document representation. here we discover topics as object categories, so that an image containing instances of several categories is modelled as a mixture of topics. the models are applied to images by using a visual analogue of a word, formed by vector quantizing sift like region descriptors. we investigate a set of increasingly demanding scenarios, starting with image sets containing only two object categories through to sets containing multiple categories (including airplanes, cars, faces, motorbikes, spotted cats) and background clutter. the object categories sample both intra-class and scale variation, and both the categories and their approximate spatial layout are found without supervision. we also demonstrate classi.cation of unseen images and images containing multiple objects. performance of the proposed unsupervised method is compared to the semi-supervised approach of .1 1this work was sponsored in part by the eu project cogvisys, the university of oxford, shell oil, and the national geospatial-intelligence agency citee id:424 citee title:object recognition as machine translation: learning a lexicon for a fixed image vocabulary citee abstract:we describe a model of object recognition as machine translationfi surrounding text:introduction common approaches to object recognition involve some form of supervision. this may range from specifying the objects location and segmentation, as in face detec-tion [17, 24]<2>, to providing only auxiliary data indicating the objects identity [1, ***, 7, 25]<2>. for a large dataset, any annotation is expensive, or may introduce unforeseen bi-ases influence:2 type:2 pair index:410 citer id:386 citer title:discovering object categories in image collections citer abstract:given a set of images containing multiple object categories, we seek to discover those categories and their image locations without supervision. we achieve this using generative models from the statistical text literature: probabilistic latent semantic analysis (plsa), and latent dirichlet allocation (lda). in text analysis these are used to discover topics in a corpus using the bag-of-words document representation. here we discover topics as object categories, so that an image containing instances of several categories is modelled as a mixture of topics. the models are applied to images by using a visual analogue of a word, formed by vector quantizing sift like region descriptors. we investigate a set of increasingly demanding scenarios, starting with image sets containing only two object categories through to sets containing multiple categories (including airplanes, cars, faces, motorbikes, spotted cats) and background clutter. the object categories sample both intra-class and scale variation, and both the categories and their approximate spatial layout are found without supervision. we also demonstrate classi.cation of unseen images and images containing multiple objects. performance of the proposed unsupervised method is compared to the semi-supervised approach of .1 1this work was sponsored in part by the eu project cogvisys, the university of oxford, shell oil, and the national geospatial-intelligence agency citee id:425 citee title:learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories citee abstract:current computational approaches to learning visual object categories require thousands of training images, are slow, cannot learn in an incremental manner and cannot incorporate prior information into the learning process. in addition, no algorithm presented in the literature has been tested on more than a handful of object categories. we present an method for learning object categories from just a few training images. it is quick and it uses prior information in a principled way. we test it on a dataset composed of images of objects belonging to 101 widely varied categories. our proposed method is based on making use of prior information, assembled from (unrelated) object categories which were previously learnt. a generative probabilistic model is used, which represents the shape and appearance of a constellation of features belonging to the object. the parameters of the model are learnt incrementally in a bayesian manner. our incremental algorithm is compared experimentally to an earlier batch bayesian algorithm, as well as to one based on maximum-likelihood. the incremental and batch versions have comparable classification performance on small training sets, but incremental learning is significantly faster, making real-time learning feasible. both bayesian methods outperform maximum likelihood on small training sets. surrounding text:cation), and two cate-gories ((7) and (8) below) from the more dif. cult 101 cate-gory dataset [***]<2>. label description # images (1) all faces 435 (1ub) faces on uniform background 435 a cropped version of (1) (2) all motorbikes 800 (2ub) motorbikes on uniform background 349 a subset of (2) (3) all airplanes 800 (3ub) airplanes on uniform background 263 a subset of (3) (4) cars rear 1155 (5) leopards 200 (6) background 1370 (7) watch 241 (8) ketch 114 the reason for picking these particular categories is pragmatic: they are the ones with the greatest number of images per category influence:1 type:2 pair index:411 citer id:386 citer title:discovering object categories in image collections citer abstract:given a set of images containing multiple object categories, we seek to discover those categories and their image locations without supervision. we achieve this using generative models from the statistical text literature: probabilistic latent semantic analysis (plsa), and latent dirichlet allocation (lda). in text analysis these are used to discover topics in a corpus using the bag-of-words document representation. here we discover topics as object categories, so that an image containing instances of several categories is modelled as a mixture of topics. the models are applied to images by using a visual analogue of a word, formed by vector quantizing sift like region descriptors. we investigate a set of increasingly demanding scenarios, starting with image sets containing only two object categories through to sets containing multiple categories (including airplanes, cars, faces, motorbikes, spotted cats) and background clutter. the object categories sample both intra-class and scale variation, and both the categories and their approximate spatial layout are found without supervision. we also demonstrate classi.cation of unseen images and images containing multiple objects. performance of the proposed unsupervised method is compared to the semi-supervised approach of .1 1this work was sponsored in part by the eu project cogvisys, the university of oxford, shell oil, and the national geospatial-intelligence agency citee id:426 citee title:finding scientific topics citee abstract:a first step in identifying the content of a document is determining which topics that document addresses. we describe a generative model for documents, introduced by blei, ng, and jordan , in which each document is generated by choosing a distribution over topics and then choosing each word in the document from a topic selected according to this distribution. we then present a markov chain monte carlo algorithm for inference in this model. we use this algorithm to analyze s from pnas by using bayesian model selection to establish the number of topics. we show that the extracted topics capture meaningful structure in the data, consistent with the class designations provided by the authors of the articles, and outline further applications of this analysis, including identifying hot topics by examining temporal dynamics and tagging abstracts to illustrate semantic content. surrounding text:the goal is to maximize the following likelihood: p(w,,)=z p(wz,)p(z)p(Ȧ)p(Õ¦)d (3) where and are multinomial parameters over the topics and words respectively and p(Ȧ)and p(Õ¦)are dirichlet distributions parameterized by the hyperparameters and . since the integral is intractable to solve directly, we solve for the parameters using gibbs sampling, as described in [***]<1>. the hyperparameters control the mixing of the multino-mial weights (lower values give less mixing) and can pre-vent degeneracy. the hyperparameters control the mixing of the multino-mial weights (lower values give less mixing) and can pre-vent degeneracy. as in [***]<1>, we specialize to scalar hyperpa-rameters (e. g influence:1 type:1 pair index:412 citer id:386 citer title:discovering object categories in image collections citer abstract:given a set of images containing multiple object categories, we seek to discover those categories and their image locations without supervision. we achieve this using generative models from the statistical text literature: probabilistic latent semantic analysis (plsa), and latent dirichlet allocation (lda). in text analysis these are used to discover topics in a corpus using the bag-of-words document representation. here we discover topics as object categories, so that an image containing instances of several categories is modelled as a mixture of topics. the models are applied to images by using a visual analogue of a word, formed by vector quantizing sift like region descriptors. we investigate a set of increasingly demanding scenarios, starting with image sets containing only two object categories through to sets containing multiple categories (including airplanes, cars, faces, motorbikes, spotted cats) and background clutter. the object categories sample both intra-class and scale variation, and both the categories and their approximate spatial layout are found without supervision. we also demonstrate classi.cation of unseen images and images containing multiple objects. performance of the proposed unsupervised method is compared to the semi-supervised approach of .1 1this work was sponsored in part by the eu project cogvisys, the university of oxford, shell oil, and the national geospatial-intelligence agency citee id:427 citee title:probabilistic latent semantic indexing citee abstract:probabilistic latent semantic indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data. fitted from a training corpus of text documents by a generalization of the expectation maximization algorithm, the utilized model is able to deal with domain{specific synonymy as well as with polysemous words. in contrast to standard latent semantic indexing (lsi) by singular value decomposition, the probabilistic variant has a solid statistical foundation and defines a proper generative data model. retrieval experiments on a number of test collections indicate substantial performance gains over direct term matching methods as well as over lsi. in particular, the combination of models with di erent dimensionalities has proven to be advantageous surrounding text:documents are images and we quantize local appearance descriptions to form visual words [4, 18, 20, 26]<2>. the two models we investigate are the probabilistic latent semantic analysis (plsa) of hof-mann [***, 10]<1>, and the latent dirichlet allocation (lda) of blei et al$ [3]<1>. both use the bag of words model, where positional relationships between features are ignored. c mixing coef. cients p(zdtest)are computed using the fold-in heuristic described in [***]<1>. in particular, the unseen image is projected on the sim-plex spanned by learned p(wz), i influence:1 type:1 pair index:413 citer id:386 citer title:discovering object categories in image collections citer abstract:given a set of images containing multiple object categories, we seek to discover those categories and their image locations without supervision. we achieve this using generative models from the statistical text literature: probabilistic latent semantic analysis (plsa), and latent dirichlet allocation (lda). in text analysis these are used to discover topics in a corpus using the bag-of-words document representation. here we discover topics as object categories, so that an image containing instances of several categories is modelled as a mixture of topics. the models are applied to images by using a visual analogue of a word, formed by vector quantizing sift like region descriptors. we investigate a set of increasingly demanding scenarios, starting with image sets containing only two object categories through to sets containing multiple categories (including airplanes, cars, faces, motorbikes, spotted cats) and background clutter. the object categories sample both intra-class and scale variation, and both the categories and their approximate spatial layout are found without supervision. we also demonstrate classi.cation of unseen images and images containing multiple objects. performance of the proposed unsupervised method is compared to the semi-supervised approach of .1 1this work was sponsored in part by the eu project cogvisys, the university of oxford, shell oil, and the national geospatial-intelligence agency citee id:428 citee title:unsupervised learning by probabilistic latent semantic analysis citee abstract:this paper presents a novel statistical method for factor analysis of binary and count data whichis closely related to a technique known as latent semantic analysis. in contrast to the latter method whichstems from linear algebra and performs a singular value decomposition of co-occurrence tables, the proposedtechnique uses a generative latent class model to perform a probabilistic mixture decomposition. this resultsin a more principled approach with a solid foundation in statistical inference. more precisely, we propose tomake use of a temperature controlled version of the expectation maximization algorithm for model fitting, whichhas shown excellent performance in practice. probabilistic latent semantic analysis has many applications, mostprominently in information retrieval, natural language processing, machine learning from text, and in related areas.the paper presents perplexity results for different types of text and linguistic data collections and discusses anapplication in automated document indexing. the experiments indicate substantial and consistent improvementsof the probabilistic method over standard latent semantic analysis. surrounding text:documents are images and we quantize local appearance descriptions to form visual words [4, 18, 20, 26]<2>. the two models we investigate are the probabilistic latent semantic analysis (plsa) of hof-mann [9, ***]<1>, and the latent dirichlet allocation (lda) of blei et al$ [3]<1>. both use the bag of words model, where positional relationships between features are ignored. the model is . tted using the expectation maximization (em) algorithm as described in [***]<1>. lda: in contrast to plsa, lda treats the multinomial weights over topics as latent random variables influence:1 type:1 pair index:414 citer id:386 citer title:discovering object categories in image collections citer abstract:given a set of images containing multiple object categories, we seek to discover those categories and their image locations without supervision. we achieve this using generative models from the statistical text literature: probabilistic latent semantic analysis (plsa), and latent dirichlet allocation (lda). in text analysis these are used to discover topics in a corpus using the bag-of-words document representation. here we discover topics as object categories, so that an image containing instances of several categories is modelled as a mixture of topics. the models are applied to images by using a visual analogue of a word, formed by vector quantizing sift like region descriptors. we investigate a set of increasingly demanding scenarios, starting with image sets containing only two object categories through to sets containing multiple categories (including airplanes, cars, faces, motorbikes, spotted cats) and background clutter. the object categories sample both intra-class and scale variation, and both the categories and their approximate spatial layout are found without supervision. we also demonstrate classi.cation of unseen images and images containing multiple objects. performance of the proposed unsupervised method is compared to the semi-supervised approach of .1 1this work was sponsored in part by the eu project cogvisys, the university of oxford, shell oil, and the national geospatial-intelligence agency citee id:429 citee title:robust wide baseline stereo from maximally stable extremal regions citee abstract: the wide - baseline stereo problem, i e the problem of establishing correspon dences between a pair of images taken from different viewpoints is studied a new set of image elements that are put into correspondence, the so called extremal regions , is introduced sirable properties: the set is closed under 1 transformation of image coordinates and 2 age intensities an efficient (near linear complexity) and practically fast de tection algorithm (near frame rate) is presented for an affinely subset of extremal regions, the maximally stable extremal regions (mser) a new robust similarity measure for establishing tentative correspon dences is proposed the robustness ensures that invariants from multiple measurement regions (regions obtained by invariant constructions from ex tremal regions), some that are significantly larger (and hence discriminative) than the msers, may be used to establish tentative correspondences the high utility of msers, multiple measurement regions and the robust metric is demonstrated in wide - baseline experiments on image pairs from both indoor and outdoor scenes significant change of scale (3 nation conditions, out - of - plane rotation, occlusion , locally anisotropic scale change and 3d translation of the viewpoint are all present in the test prob lems good estimates of epipolar geometry (average distance from corre sponding points to the epipolar line below 0 are obtained surrounding text:we use vector quan-tized sift descriptors [12]<1> computed on af. ne covariant re-gions [***, 14, 16]<1>. af. the second is constructed using the maximally stable procedure of matas et al. [***]<1> where areas are selected from an intensity wa-tershed image segmentation. for both of these we use the binaries provided at [23]<1> influence:3 type:1 pair index:415 citer id:386 citer title:discovering object categories in image collections citer abstract:given a set of images containing multiple object categories, we seek to discover those categories and their image locations without supervision. we achieve this using generative models from the statistical text literature: probabilistic latent semantic analysis (plsa), and latent dirichlet allocation (lda). in text analysis these are used to discover topics in a corpus using the bag-of-words document representation. here we discover topics as object categories, so that an image containing instances of several categories is modelled as a mixture of topics. the models are applied to images by using a visual analogue of a word, formed by vector quantizing sift like region descriptors. we investigate a set of increasingly demanding scenarios, starting with image sets containing only two object categories through to sets containing multiple categories (including airplanes, cars, faces, motorbikes, spotted cats) and background clutter. the object categories sample both intra-class and scale variation, and both the categories and their approximate spatial layout are found without supervision. we also demonstrate classi.cation of unseen images and images containing multiple objects. performance of the proposed unsupervised method is compared to the semi-supervised approach of .1 1this work was sponsored in part by the eu project cogvisys, the university of oxford, shell oil, and the national geospatial-intelligence agency citee id:430 citee title:weak hypotheses and boosting for generic object detection and recognition citee abstract:fi in this paper we describe the first stage of a new learning system for object detection and recognitionfi for our system we propose boosting as the underlying learning techniquefi this allows the use of very diverse sets of visual features in the learning process within a com-mon framework: boosting, together with a weak hypotheses finder, may choose very inhomogeneous features as most relevant for combina-tion into a final hypothesisfi as another advantage the weak hypotheses finder may search the weak hypotheses space without explicit calculation of all available hypotheses, reducing computation timefi this contrasts the related work of agarwal and roth where winnow was used as learning algorithm and all weak hypotheses were calculated explicitlyfi in our first empirical evaluation we use four types of local descriptors: two basic ones consisting of a set of grayvalues and intensity moments and two high level descriptors: moment invariants and sifts fi the descriptors are calculated from local patches detected by an inter-est point operatorfi the weak hypotheses finder selects one of the local patches and one type of local descriptor and e ciently searches for the most discriminative similarity thresholdfi this di ers from other work on boosting for object recognition where simple rectangular hypotheses or complex classifiers have been usedfi in relatively simple images, where the objects are prominent, our approach yields results comparable to the state-of-the-art fi but we also obtain very good results on more complex images, where the objects are located in arbitrary positions, poses, and scales in the imagesfi these results indicate that our flexible approach, which also allows the inclusion of features from segmented re-gions and even spatial relationships, leads us a significant step towards generic object recognitionfi surrounding text:others have used similar descriptors for object classi. cation [4, ***]<2>, but in a supervised setting. we compare the two statistical models with a control global texture model, similar to those proposed for pre-attentive vision [22]<2> and image retrieval [19]<2> influence:2 type:2 pair index:416 citer id:386 citer title:discovering object categories in image collections citer abstract:given a set of images containing multiple object categories, we seek to discover those categories and their image locations without supervision. we achieve this using generative models from the statistical text literature: probabilistic latent semantic analysis (plsa), and latent dirichlet allocation (lda). in text analysis these are used to discover topics in a corpus using the bag-of-words document representation. here we discover topics as object categories, so that an image containing instances of several categories is modelled as a mixture of topics. the models are applied to images by using a visual analogue of a word, formed by vector quantizing sift like region descriptors. we investigate a set of increasingly demanding scenarios, starting with image sets containing only two object categories through to sets containing multiple categories (including airplanes, cars, faces, motorbikes, spotted cats) and background clutter. the object categories sample both intra-class and scale variation, and both the categories and their approximate spatial layout are found without supervision. we also demonstrate classi.cation of unseen images and images containing multiple objects. performance of the proposed unsupervised method is compared to the semi-supervised approach of .1 1this work was sponsored in part by the eu project cogvisys, the university of oxford, shell oil, and the national geospatial-intelligence agency citee id:181 citee title:a statistical method for 3d object detection applied to faces and cars citee abstract:in this paper, we describe a statistical method for 3d object detectionfi we represent the statistics of both object appear-ance and "non-object" appearance using a product of histo-gramsfi each histogram represents the joint statistics of a subset of wavelet coefficients and their position on the objectfi our approach is to use many such histograms representing a wide variety of visual attributesfi using this method, we have devel-oped the first algorithm that can reliably detect human faces with out-of-plane rotation and the first algorithm that can reli-ably detect passenger cars over a wide range of viewpointsfi surrounding text:introduction common approaches to object recognition involve some form of supervision. this may range from specifying the objects location and segmentation, as in face detec-tion [***, 24]<2>, to providing only auxiliary data indicating the objects identity [1, 5, 7, 25]<2>. for a large dataset, any annotation is expensive, or may introduce unforeseen bi-ases influence:2 type:2 pair index:417 citer id:386 citer title:discovering object categories in image collections citer abstract:given a set of images containing multiple object categories, we seek to discover those categories and their image locations without supervision. we achieve this using generative models from the statistical text literature: probabilistic latent semantic analysis (plsa), and latent dirichlet allocation (lda). in text analysis these are used to discover topics in a corpus using the bag-of-words document representation. here we discover topics as object categories, so that an image containing instances of several categories is modelled as a mixture of topics. the models are applied to images by using a visual analogue of a word, formed by vector quantizing sift like region descriptors. we investigate a set of increasingly demanding scenarios, starting with image sets containing only two object categories through to sets containing multiple categories (including airplanes, cars, faces, motorbikes, spotted cats) and background clutter. the object categories sample both intra-class and scale variation, and both the categories and their approximate spatial layout are found without supervision. we also demonstrate classi.cation of unseen images and images containing multiple objects. performance of the proposed unsupervised method is compared to the semi-supervised approach of .1 1this work was sponsored in part by the eu project cogvisys, the university of oxford, shell oil, and the national geospatial-intelligence agency citee id:431 citee title:video google: a text retrieval approach to object matching in videos citee abstract:we describe an approach to object and scene retrieval which searches for and localizes all the occurrences of a user outlined object in a video. the object is represented by a set of viewpoint invariant region descriptors so that recognition can proceed successfully despite changes in viewpoint, illumination and partial occlusion. the temporal continuity of the video within a shot is used to track the regions in order to reject unstable regions and reduce the effects of noise in the descriptors. the analogy with text retrieval is in the implementation where matches on descriptors are pre-computed (using vector quantization), and inverted file systems and document rankings are used. the result is that retrieval is immediate, returning a ranked list of key frames/shots in the manner of google. the method is illustrated for matching on two full length feature films. surrounding text:we ap-ply models used in statistical natural language processing to discover object categories and their image layout analo-gously to topic discovery in text. documents are images and we quantize local appearance descriptions to form visual words [4, ***, 20, 26]<2>. the two models we investigate are the probabilistic latent semantic analysis (plsa) of hof-mann [9, 10]<1>, and the latent dirichlet allocation (lda) of blei et al$ [3]<1> influence:2 type:2 pair index:418 citer id:386 citer title:discovering object categories in image collections citer abstract:given a set of images containing multiple object categories, we seek to discover those categories and their image locations without supervision. we achieve this using generative models from the statistical text literature: probabilistic latent semantic analysis (plsa), and latent dirichlet allocation (lda). in text analysis these are used to discover topics in a corpus using the bag-of-words document representation. here we discover topics as object categories, so that an image containing instances of several categories is modelled as a mixture of topics. the models are applied to images by using a visual analogue of a word, formed by vector quantizing sift like region descriptors. we investigate a set of increasingly demanding scenarios, starting with image sets containing only two object categories through to sets containing multiple categories (including airplanes, cars, faces, motorbikes, spotted cats) and background clutter. the object categories sample both intra-class and scale variation, and both the categories and their approximate spatial layout are found without supervision. we also demonstrate classi.cation of unseen images and images containing multiple objects. performance of the proposed unsupervised method is compared to the semi-supervised approach of .1 1this work was sponsored in part by the eu project cogvisys, the university of oxford, shell oil, and the national geospatial-intelligence agency citee id:242 citee title:automated image retrieval using color and texture citee abstract:the growing prevalence of digital images and videos is increasing the need for filters based on visual content. unfortunately, the appearance of effective tools for searching for and retrieving images and videos from large archives has not accompanied the proliferation of images and videos in digital form. in addition to the text-based indexes built upon human supplied annotations, digital image and video libraries require new algorithms for the automated extraction and indexing of relevant image features. in this paper we investigate a system for automated content extraction and database searching based upon color and texture features. these features are important because colors and textures are fundamental characteristics of the content of all images, giving this work general application towards databases of images and videos from a variety of domains. we present new algorithms for the automated extraction of color and texture information that use binary set representations of color and texture, respectively. we put special emphasis on the processes of feature extraction and indexing. we demonstrate that our binary feature sets for texture and color provide excellent performance in query response time while maintaining highly effective discriminability, accuracy in spatial localization and capability for extraction from compressed data representations. we present the color and texture set extraction and indexing techniques and contrast them to other approaches. we examine texture and color searching on databases of 500 and 3000 color images, respectively. finally, we examine the relationship between color and texture in application towards image and video retrieval. we explore the capability of combining the color and texture maps of images to obtain unified feature maps that characterize image regions by both color and texture. in particular, we indicate that by combining modalities we can better capture and index color patterns and characterize real world objects within images and videos. finally, we examine the nature and performance of image and video database queries that combine color and texture. surrounding text:cation [4, 15]<2>, but in a supervised setting. we compare the two statistical models with a control global texture model, similar to those proposed for pre-attentive vision [22]<2> and image retrieval [***]<2>. sect influence:2 type:2 pair index:419 citer id:386 citer title:discovering object categories in image collections citer abstract:given a set of images containing multiple object categories, we seek to discover those categories and their image locations without supervision. we achieve this using generative models from the statistical text literature: probabilistic latent semantic analysis (plsa), and latent dirichlet allocation (lda). in text analysis these are used to discover topics in a corpus using the bag-of-words document representation. here we discover topics as object categories, so that an image containing instances of several categories is modelled as a mixture of topics. the models are applied to images by using a visual analogue of a word, formed by vector quantizing sift like region descriptors. we investigate a set of increasingly demanding scenarios, starting with image sets containing only two object categories through to sets containing multiple categories (including airplanes, cars, faces, motorbikes, spotted cats) and background clutter. the object categories sample both intra-class and scale variation, and both the categories and their approximate spatial layout are found without supervision. we also demonstrate classi.cation of unseen images and images containing multiple objects. performance of the proposed unsupervised method is compared to the semi-supervised approach of .1 1this work was sponsored in part by the eu project cogvisys, the university of oxford, shell oil, and the national geospatial-intelligence agency citee id:432 citee title:improved video content indexing by multiple latent semantic analysis citee abstract:low-level features are now becoming insufficient to build efficient content-based retrieval systems. users are not interested any longer in retrieving visually similar content, but they expect retrieval systems to also find documents with similar semantic content. bridging the gap between low-level features and semantic content is a challenging task necessary for future retrieval systems. latent semantic analysis (lsa) was successfully introduced to efficiently index text documents by detecting synonyms and the polysemy of words. we have successfully proposed an adaptation of lsa to model video content for object retrieval and semantic content estimation. following this idea we now present a new model composed of multiple lsas (m-lsa) to better represent the video content. in the experimental section, we make a comparison of lsa and m-lsa on two problems, namely object retrieval and semantic content estimation. surrounding text:we ap-ply models used in statistical natural language processing to discover object categories and their image layout analo-gously to topic discovery in text. documents are images and we quantize local appearance descriptions to form visual words [4, 18, ***, 26]<2>. the two models we investigate are the probabilistic latent semantic analysis (plsa) of hof-mann [9, 10]<1>, and the latent dirichlet allocation (lda) of blei et al$ [3]<1>. for example, the topic vectors we have discovered may now be applied directly as semantic vectors for retrieval from image databases, and we anticipate signi. cant per-formance improvements compared to standard approaches such as lsa [***]<2>. references [1] k influence:3 type:2 pair index:420 citer id:386 citer title:discovering object categories in image collections citer abstract:given a set of images containing multiple object categories, we seek to discover those categories and their image locations without supervision. we achieve this using generative models from the statistical text literature: probabilistic latent semantic analysis (plsa), and latent dirichlet allocation (lda). in text analysis these are used to discover topics in a corpus using the bag-of-words document representation. here we discover topics as object categories, so that an image containing instances of several categories is modelled as a mixture of topics. the models are applied to images by using a visual analogue of a word, formed by vector quantizing sift like region descriptors. we investigate a set of increasingly demanding scenarios, starting with image sets containing only two object categories through to sets containing multiple categories (including airplanes, cars, faces, motorbikes, spotted cats) and background clutter. the object categories sample both intra-class and scale variation, and both the categories and their approximate spatial layout are found without supervision. we also demonstrate classi.cation of unseen images and images containing multiple objects. performance of the proposed unsupervised method is compared to the semi-supervised approach of .1 1this work was sponsored in part by the eu project cogvisys, the university of oxford, shell oil, and the national geospatial-intelligence agency citee id:371 citee title:contextual priming for object detection citee abstract:there is general consensus that context can be a rich source of information about an object"s identity,location and scalefi in fact, the structure of many real-world scenes is governed by strong configurationalrules akin to those that apply to a single objectfi here we introduce a simple framework for modelingthe relationship between context and object properties based on the correlation between the statisticsof low-level features across the entire scene and the objects that it containsfi the surrounding text:cation [4, 15]<2>, but in a supervised setting. we compare the two statistical models with a control global texture model, similar to those proposed for pre-attentive vision [***]<2> and image retrieval [19]<2>. sect influence:2 type:2 pair index:421 citer id:386 citer title:discovering object categories in image collections citer abstract:given a set of images containing multiple object categories, we seek to discover those categories and their image locations without supervision. we achieve this using generative models from the statistical text literature: probabilistic latent semantic analysis (plsa), and latent dirichlet allocation (lda). in text analysis these are used to discover topics in a corpus using the bag-of-words document representation. here we discover topics as object categories, so that an image containing instances of several categories is modelled as a mixture of topics. the models are applied to images by using a visual analogue of a word, formed by vector quantizing sift like region descriptors. we investigate a set of increasingly demanding scenarios, starting with image sets containing only two object categories through to sets containing multiple categories (including airplanes, cars, faces, motorbikes, spotted cats) and background clutter. the object categories sample both intra-class and scale variation, and both the categories and their approximate spatial layout are found without supervision. we also demonstrate classi.cation of unseen images and images containing multiple objects. performance of the proposed unsupervised method is compared to the semi-supervised approach of .1 1this work was sponsored in part by the eu project cogvisys, the university of oxford, shell oil, and the national geospatial-intelligence agency citee id:433 citee title:rapid object detection using a boosted cascade of simple features citee abstract:this paper describes a machine learning approach for visual object detection which is capable of processing images extremely rapidly and achieving high detection ratesfi this work is distinguished by three key contributionsfi the first is the introduction of a new image representation called the "integral image" which allows the features used by our detector to be computed very quicklyfi the second is a learning algorithm, based on adaboost, which selects a small number of critical visual features surrounding text:introduction common approaches to object recognition involve some form of supervision. this may range from specifying the objects location and segmentation, as in face detec-tion [17, ***]<2>, to providing only auxiliary data indicating the objects identity [1, 5, 7, 25]<2>. for a large dataset, any annotation is expensive, or may introduce unforeseen bi-ases influence:2 type:2 pair index:422 citer id:386 citer title:discovering object categories in image collections citer abstract:given a set of images containing multiple object categories, we seek to discover those categories and their image locations without supervision. we achieve this using generative models from the statistical text literature: probabilistic latent semantic analysis (plsa), and latent dirichlet allocation (lda). in text analysis these are used to discover topics in a corpus using the bag-of-words document representation. here we discover topics as object categories, so that an image containing instances of several categories is modelled as a mixture of topics. the models are applied to images by using a visual analogue of a word, formed by vector quantizing sift like region descriptors. we investigate a set of increasingly demanding scenarios, starting with image sets containing only two object categories through to sets containing multiple categories (including airplanes, cars, faces, motorbikes, spotted cats) and background clutter. the object categories sample both intra-class and scale variation, and both the categories and their approximate spatial layout are found without supervision. we also demonstrate classi.cation of unseen images and images containing multiple objects. performance of the proposed unsupervised method is compared to the semi-supervised approach of .1 1this work was sponsored in part by the eu project cogvisys, the university of oxford, shell oil, and the national geospatial-intelligence agency citee id:434 citee title:unsupervised learning of models for recognition citee abstract:fi we present a method to learn object class models from unlabeled andunsegmented cluttered scenes for the purpose of visual object recognitionfi wefocus on a particular type of model where objects are represented as flexible constellationsof rigid parts (features)fi the variability within a class is representedby a joint probability density function (pdf) on the shape of the constellation andthe output of part detectorsfi in a first stage, the method automatically identifiesdistinctive surrounding text:introduction common approaches to object recognition involve some form of supervision. this may range from specifying the objects location and segmentation, as in face detec-tion [17, 24]<2>, to providing only auxiliary data indicating the objects identity [1, 5, 7, ***]<2>. for a large dataset, any annotation is expensive, or may introduce unforeseen bi-ases influence:2 type:2 pair index:423 citer id:386 citer title:discovering object categories in image collections citer abstract:given a set of images containing multiple object categories, we seek to discover those categories and their image locations without supervision. we achieve this using generative models from the statistical text literature: probabilistic latent semantic analysis (plsa), and latent dirichlet allocation (lda). in text analysis these are used to discover topics in a corpus using the bag-of-words document representation. here we discover topics as object categories, so that an image containing instances of several categories is modelled as a mixture of topics. the models are applied to images by using a visual analogue of a word, formed by vector quantizing sift like region descriptors. we investigate a set of increasingly demanding scenarios, starting with image sets containing only two object categories through to sets containing multiple categories (including airplanes, cars, faces, motorbikes, spotted cats) and background clutter. the object categories sample both intra-class and scale variation, and both the categories and their approximate spatial layout are found without supervision. we also demonstrate classi.cation of unseen images and images containing multiple objects. performance of the proposed unsupervised method is compared to the semi-supervised approach of .1 1this work was sponsored in part by the eu project cogvisys, the university of oxford, shell oil, and the national geospatial-intelligence agency citee id:435 citee title:hidden semantic concept discovery in region based image retrieval citee abstract:this work addresses content based image retrieval (cbir), focusing on developing a hidden semantic concept discovery methodology to address effective semantics-intensive image retrieval. in our approach, each image in the database is segmented to region; associated with homogenous color, texture, and shape features. by exploiting regional statistical information in each image and employing a vector quantization method, a uniform and sparse region-based representation is achieved. with this representation a probabilistic model based on statistical-hidden-class assumptions of the image database is obtained, to which expectation-maximization (em) technique is applied to analyze semantic concepts hidden in the database. an elaborated retrieval algorithm is designed to support the probabilistic model. the semantic similarity is measured through integrating the posterior probabilities of the transformed query image, as well as a constructed negative example, to the discovered semantic concepts. the proposed approach has a solid statistical foundation and the experimental evaluations on a database of 10,000 general-purposed images demonstrate its promise of the effectiveness. surrounding text:we ap-ply models used in statistical natural language processing to discover object categories and their image layout analo-gously to topic discovery in text. documents are images and we quantize local appearance descriptions to form visual words [4, 18, 20, ***]<2>. the two models we investigate are the probabilistic latent semantic analysis (plsa) of hof-mann [9, 10]<1>, and the latent dirichlet allocation (lda) of blei et al$ [3]<1> influence:2 type:2 pair index:424 citer id:397 citer title:designing novel review ranking systems citer abstract:with the rapid growth of the internet, users' ability to publish content has created active electronic communities that provide a wealth of product information. consumers naturally gravitate to reading reviews in order to decide whether to buy a product. however, the high volume of reviews that are typically published for a single product makes it harder for individuals to locate the best reviews and understand the true underlying quality of a product based on the reviews. similarly, the manufacturer of a product needs to identify the reviews that inuence the customer base, and examine the content of these reviews. in this paper, we propose two ranking mechanisms for ranking product reviews: a consumer-oriented ranking mechanism ranks the reviews according to their expected helpfulness, and a manufactureroriented ranking mechanism ranks the reviews according to their expected e?ect on sales. our ranking mechanism combines econometric analysis with text mining techniques in general, and with subjectivity analysis in particular. we show that subjectivity analysis can give useful clues about the helpfulness of a review and about its impact on sales. our results can have several implications for the market design of online opinion forums citee id:398 citee title:show me the money! deriving the pricing power of product features by mining consumer reviews citee abstract:the increasing pervasiveness of the internet has dramatically changed the way that consumers shop for goods. consumergenerated product reviews have become a valuable source of information for customers, who read the reviews and decide whether to buy the product based on the information provided. in this paper, we use techniques that decompose the reviews into segments that evaluate the individual characteristics of a product (e.g., image quality and battery life for a digital camera). then, as a major contribution of this paper, we adapt methods from the econometrics literature, specifically the hedonic regression concept, to estimate: (a) the weight that customers place on each individual product feature, (b) the implicit evaluation score that customers assign to each feature, and (c) how these evaluations a?ect the revenue for a given product. towards this goal, we develop a novel hybrid technique combining text mining and econometrics that models consumer product reviews as elements in a tensor product of feature and evaluation spaces. we then impute the quantitative impact of consumer reviews on product demand as a linear functional from this tensor product space. we demonstrate how to use a lowdimension approximation of this functional to signicantly reduce the number of model parameters, while still providing good experimental results. we evaluate our technique using a data set from amazon.com consisting of sales data and the related consumer reviews posted over a 15-month period for 242 products. our experimental evaluation shows that we can extract actionable business intelligence from the data and better understand the customer preferences and actions. we also show that the textual portion of the reviews can improve product sales prediction compared to a baseline technique that simply relies on numeric data. surrounding text:hence, for each review, we have a \subjectivity" score for each of the sentences. 1we should note, though, that the numeric rating does not capture all the polarity information that appears in the review [***]<2>. 2variable obs influence:2 type:2 pair index:425 citer id:397 citer title:designing novel review ranking systems citer abstract:with the rapid growth of the internet, users' ability to publish content has created active electronic communities that provide a wealth of product information. consumers naturally gravitate to reading reviews in order to decide whether to buy a product. however, the high volume of reviews that are typically published for a single product makes it harder for individuals to locate the best reviews and understand the true underlying quality of a product based on the reviews. similarly, the manufacturer of a product needs to identify the reviews that inuence the customer base, and examine the content of these reviews. in this paper, we propose two ranking mechanisms for ranking product reviews: a consumer-oriented ranking mechanism ranks the reviews according to their expected helpfulness, and a manufactureroriented ranking mechanism ranks the reviews according to their expected e?ect on sales. our ranking mechanism combines econometric analysis with text mining techniques in general, and with subjectivity analysis in particular. we show that subjectivity analysis can give useful clues about the helpfulness of a review and about its impact on sales. our results can have several implications for the market design of online opinion forums citee id:399 citee title:the e?ect of word of mouth on sales: online book reviews citee abstract:the creation of online consumer communities to provide product reviews and advice has been touted as an important, albeit somewhat expensive component of internet retail strategies. in this paper, we characterize reviewer behavior at two popular internet sites and examine the effect of consumer reviews on firms' sales. we use publicly available data from the two leading online booksellers, amazon.com and barnesandnoble.com, to construct measures of each firm's sales of individual books. we also gather extensive consumer review data at the two sites. first, we characterize the reviewer behavior on the two sites such as the distribution of the number of ratings and the valence and length of ratings, as well as ratings across different subject categories. second, we measure the effect of individual reviews on the relative shares of books across the two sites. we argue that our methodology of comparing the sales and reviews of a given book across internet retailers allows us to improve on the existing literature by better capturing a causal relationship between word of mouth (reviews) and sales since we are able to difference out factors that affect the sales and word of mouth of both retailers, such as the book's quality. we examine the incremental sales effects of having reviews for a particular book versus not having reviews and also the differential sales effects of positive and negative reviews. our large database of books also allows us to control for other important confounding factors such as differences across the sites in prices and shipping times surrounding text:with the rapid growth of the internet these conversations have migrated in online markets, creating active electronic communities that provide a wealth of product information. consumers now rely on online product reviews, posted online by other consumers, for their purchase decisions [***]<3>. reviewers contribute time, energy, and other resources, enabling a social structure that provides benets both for the users and the companies that host electronic markets influence:2 type:3 pair index:426 citer id:397 citer title:designing novel review ranking systems citer abstract:with the rapid growth of the internet, users' ability to publish content has created active electronic communities that provide a wealth of product information. consumers naturally gravitate to reading reviews in order to decide whether to buy a product. however, the high volume of reviews that are typically published for a single product makes it harder for individuals to locate the best reviews and understand the true underlying quality of a product based on the reviews. similarly, the manufacturer of a product needs to identify the reviews that inuence the customer base, and examine the content of these reviews. in this paper, we propose two ranking mechanisms for ranking product reviews: a consumer-oriented ranking mechanism ranks the reviews according to their expected helpfulness, and a manufactureroriented ranking mechanism ranks the reviews according to their expected e?ect on sales. our ranking mechanism combines econometric analysis with text mining techniques in general, and with subjectivity analysis in particular. we show that subjectivity analysis can give useful clues about the helpfulness of a review and about its impact on sales. our results can have several implications for the market design of online opinion forums citee id:400 citee title:yahoo! for amazon: sentiment extraction from small talk on the web citee abstract:extracting sentiment from text is a hard semantic problem. we develop a methodology for extracting small investor sentiment from stock message boards. the algorithm comprises different classifier algorithms coupled together by a voting scheme. accuracy levels are similar to widely used bayes classifiers, but false positives are lower and sentiment accuracy higher. time series and cross-sectional aggregation of message information improves the quality of the resultant sentiment index, particularly in the presence of slang and ambiguity. empirical applications evidence a relationship with stock valuestech-sector postings are related to stock index levels, and to volumes and volatility. the algorithms may be used to assess the impact on investor opinion of management announcements, press releases, third-party news, and regulatory changes. surrounding text:our research papers aim to make a contribution by bridging these two streams of work. we also add to an emerging stream of literature that combines economic methods with text mining [***, 9, 16]<2>. for example, das and chen [***]<2> examined bulletin boards on yahoo. we also add to an emerging stream of literature that combines economic methods with text mining [***, 9, 16]<2>. for example, das and chen [***]<2> examined bulletin boards on yahoo. finance to extract the sentiment of individual investors about tech companies and about the tech sector in general$ they have shown that the aggregate tech sector sentiment predicts well the stock index movement, even though the sentiment cannot predict well the individual stock movements influence:2 type:2 pair index:427 citer id:397 citer title:designing novel review ranking systems citer abstract:with the rapid growth of the internet, users' ability to publish content has created active electronic communities that provide a wealth of product information. consumers naturally gravitate to reading reviews in order to decide whether to buy a product. however, the high volume of reviews that are typically published for a single product makes it harder for individuals to locate the best reviews and understand the true underlying quality of a product based on the reviews. similarly, the manufacturer of a product needs to identify the reviews that inuence the customer base, and examine the content of these reviews. in this paper, we propose two ranking mechanisms for ranking product reviews: a consumer-oriented ranking mechanism ranks the reviews according to their expected helpfulness, and a manufactureroriented ranking mechanism ranks the reviews according to their expected e?ect on sales. our ranking mechanism combines econometric analysis with text mining techniques in general, and with subjectivity analysis in particular. we show that subjectivity analysis can give useful clues about the helpfulness of a review and about its impact on sales. our results can have several implications for the market design of online opinion forums citee id:143 citee title:a multi-level examination of the impact of social identities on economic transactions in electronic markets citee abstract:three of the most important uses of the internet today are as an economic marketplace, as a forum for social interaction, and as a source of information. in this paper, we explore how these three activities come together, in the form of emergent social communities built around information exchanges within it-enabled electronic marketplaces. drawing on social identity theory, we suggest that the relationship between online consumer reviews and internet product sales is partially explained by social identity processes. using a unique dataset based on both chronologically compiled ratings as well as reviewer characteristics for a given set of products and geographical location-based purchasing behavior from amazon, we provide evidence at the community level linking the prevalence of identity claiming behavior in an online community with subsequent product sales. in addition, we show that when reviewers claim to be from a particular geographic location, subsequent product sales are higher in that region. at the review level of analysis, we show that subsequent reviews conform to identity-claiming norms set in previous reviews, and that identity claiming that conforms to community norms elicits identity granting. furthermore, our results suggest that the prevalence of identity granting has implications for economic exchange in the form of product sales. implications for research on word-of-mouth and electronic communities are discussed. surrounding text:ort for ranking reviews for consumers comes in the form of peer reviewing in the review forums, where customers give helpful votes to other reviews. in digital markets, individuals use peer ratings to conrm that other reviewers are member in good standing within the community [***]<3>. unfortunately, the helpful votes are not a useful feature for ranking recent reviews: the helpful votes are accumulated over a long period of time, and hence cannot be used for review placement in a short- or medium-term time frame influence:2 type:3 pair index:428 citer id:397 citer title:designing novel review ranking systems citer abstract:with the rapid growth of the internet, users' ability to publish content has created active electronic communities that provide a wealth of product information. consumers naturally gravitate to reading reviews in order to decide whether to buy a product. however, the high volume of reviews that are typically published for a single product makes it harder for individuals to locate the best reviews and understand the true underlying quality of a product based on the reviews. similarly, the manufacturer of a product needs to identify the reviews that inuence the customer base, and examine the content of these reviews. in this paper, we propose two ranking mechanisms for ranking product reviews: a consumer-oriented ranking mechanism ranks the reviews according to their expected helpfulness, and a manufactureroriented ranking mechanism ranks the reviews according to their expected e?ect on sales. our ranking mechanism combines econometric analysis with text mining techniques in general, and with subjectivity analysis in particular. we show that subjectivity analysis can give useful clues about the helpfulness of a review and about its impact on sales. our results can have several implications for the market design of online opinion forums citee id:402 citee title:opinion mining using econometrics: a case study on reputation systems citee abstract:deriving the polarity and strength of opinions is an important research topic, attracting significant attention over the last few years. in this work, to measure the strength and polarity of an opinion, we consider the economic context in which the opinion is evaluated, instead of using human annotators or linguistic resources. we rely on the fact that text in on-line systems influences the behavior of humans and this effect can be observed using some easy-to-measure economic variables, such as revenues or product prices. by reversing the logic, we infer the semantic orientation and strength of an opinion by tracing the changes in the associated economic variable. in effect, we use econometrics to identify the economic value of text and assign a dollar value to each opinion phrase, measuring sentiment effectively and without the need for manual labeling. we argue that by interpreting opinions using econometrics, we have the first objective, quantifiable, and contextsensitive evaluation of opinions. we make the discussion concrete by presenting results on the reputation system of amazon.com. we show that user feedback affects the pricing power of merchants and by measuring their pricing power we can infer the polarity and strength of the underlying feedback postings. surrounding text:our research papers aim to make a contribution by bridging these two streams of work. we also add to an emerging stream of literature that combines economic methods with text mining [5, ***, 16]<2>. for example, das and chen [5]<2> examined bulletin boards on yahoo influence:2 type:2 pair index:429 citer id:397 citer title:designing novel review ranking systems citer abstract:with the rapid growth of the internet, users' ability to publish content has created active electronic communities that provide a wealth of product information. consumers naturally gravitate to reading reviews in order to decide whether to buy a product. however, the high volume of reviews that are typically published for a single product makes it harder for individuals to locate the best reviews and understand the true underlying quality of a product based on the reviews. similarly, the manufacturer of a product needs to identify the reviews that inuence the customer base, and examine the content of these reviews. in this paper, we propose two ranking mechanisms for ranking product reviews: a consumer-oriented ranking mechanism ranks the reviews according to their expected helpfulness, and a manufactureroriented ranking mechanism ranks the reviews according to their expected e?ect on sales. our ranking mechanism combines econometric analysis with text mining techniques in general, and with subjectivity analysis in particular. we show that subjectivity analysis can give useful clues about the helpfulness of a review and about its impact on sales. our results can have several implications for the market design of online opinion forums citee id:167 citee title:the predictive power of online chatter citee abstract:an increasing fraction of the global discourse is migrating online in the form of blogs, bulletin boards, web pages, wikis, editorials, and a dizzying array of new collaborative technologies. the migration has now proceeded to the point that topics reflecting certain individual products are sufficiently popular to allow targeted online tracking of the ebb and flow of chatter around these topics. based on an analysis of around half a million sales rank values for 2,340 books over a period of four months, and correlating postings in blogs, media, and web pages, we are able to draw several interesting conclusions.first, carefully hand-crafted queries produce matching postings whose volume predicts sales ranks. second, these queries can be automatically generated in many cases. and third, even though sales rank motion might be difficult to predict in general, algorithmic predictors can use online postings to successfully predict spikes in sales rank. surrounding text:gruhl et al. [***]<2> analyzed the correlation between online mentions of a product and sales of that product. using sales rank information for more than 2,000 books from amazon influence:2 type:2 pair index:430 citer id:397 citer title:designing novel review ranking systems citer abstract:with the rapid growth of the internet, users' ability to publish content has created active electronic communities that provide a wealth of product information. consumers naturally gravitate to reading reviews in order to decide whether to buy a product. however, the high volume of reviews that are typically published for a single product makes it harder for individuals to locate the best reviews and understand the true underlying quality of a product based on the reviews. similarly, the manufacturer of a product needs to identify the reviews that inuence the customer base, and examine the content of these reviews. in this paper, we propose two ranking mechanisms for ranking product reviews: a consumer-oriented ranking mechanism ranks the reviews according to their expected helpfulness, and a manufactureroriented ranking mechanism ranks the reviews according to their expected e?ect on sales. our ranking mechanism combines econometric analysis with text mining techniques in general, and with subjectivity analysis in particular. we show that subjectivity analysis can give useful clues about the helpfulness of a review and about its impact on sales. our results can have several implications for the market design of online opinion forums citee id:403 citee title:mining and summarizing customer reviews citee abstract:merchants selling products on the web often ask their customers to review the products that they have purchased and the associated services. as e-commerce is becoming more and more popular, the number of customer reviews that a product receives grows rapidly. for a popular product, the number of reviews can be in hundreds or even thousands. this makes it difficult for a potential customer to read them to make an informed decision on whether to purchase the product. it also makes it difficult for the manufacturer of the product to keep track and to manage customer opinions. for the manufacturer, there are additional difficulties because many merchant sites may sell the same product and the manufacturer normally produces many kinds of products. in this research, we aim to mine and to summarize all the customer reviews of a product. this summarization task is different from traditional text summarization because we only mine the features of the product on which the customers have expressed their opinions and whether the opinions are positive or negative. we do not summarize the reviews by selecting a subset or rewrite some of the original sentences from the reviews to capture the main points as in the classic text summarization. our task is performed in three steps: (1) mining product features that have been commented on by customers; (2) identifying opinion sentences in each review and deciding whether each opinion sentence is positive or negative; (3) summarizing the results. this paper proposes several novel techniques to perform these tasks. our experimental results using reviews of a number of products sold online demonstrate the effectiveness of the techniques surrounding text:ects product sales on a market such as amazon. while prior work in computer science has extensively analyzed and classied sentiments in online opinions [***, 14, 15, 17, 21, 23]<2>, they have not examined the economic impact of the reviews. the rest of the paper is structured as follows. relatedwork our research program is inspired by previous studies about opinion strength analysis. while prior work in computer science has extensively analyzed and classied sentiments in online opinions [***, 14, 15, 17, 21, 23]<2>, they have not examined the economic impact of the reviews. similarly, while prior work has looked at how the average rating of a review or social factors (such as self-descriptive information) is related to the proportion of helpful votes received by a review, it has not looked at how the textual sentiment of a review a influence:2 type:2 pair index:431 citer id:397 citer title:designing novel review ranking systems citer abstract:with the rapid growth of the internet, users' ability to publish content has created active electronic communities that provide a wealth of product information. consumers naturally gravitate to reading reviews in order to decide whether to buy a product. however, the high volume of reviews that are typically published for a single product makes it harder for individuals to locate the best reviews and understand the true underlying quality of a product based on the reviews. similarly, the manufacturer of a product needs to identify the reviews that inuence the customer base, and examine the content of these reviews. in this paper, we propose two ranking mechanisms for ranking product reviews: a consumer-oriented ranking mechanism ranks the reviews according to their expected helpfulness, and a manufactureroriented ranking mechanism ranks the reviews according to their expected e?ect on sales. our ranking mechanism combines econometric analysis with text mining techniques in general, and with subjectivity analysis in particular. we show that subjectivity analysis can give useful clues about the helpfulness of a review and about its impact on sales. our results can have several implications for the market design of online opinion forums citee id:111 citee title:determining the sentiment of opinions citee abstract:identifying sentiments (the affective parts of opinions) is a challenging problem. we present a system that, given a topic, automatically finds the people who hold opinions about that topic and the sentiment of each opinion. the system contains a module for determining word sentiment and another for combining sentiments within a sentence. we experiment with various models of classifying and combining sentiment at word and sentence levels, with promising results. surrounding text:ects product sales on a market such as amazon. while prior work in computer science has extensively analyzed and classied sentiments in online opinions [12, ***, 15, 17, 21, 23]<2>, they have not examined the economic impact of the reviews. the rest of the paper is structured as follows. relatedwork our research program is inspired by previous studies about opinion strength analysis. while prior work in computer science has extensively analyzed and classied sentiments in online opinions [12, ***, 15, 17, 21, 23]<2>, they have not examined the economic impact of the reviews. similarly, while prior work has looked at how the average rating of a review or social factors (such as self-descriptive information) is related to the proportion of helpful votes received by a review, it has not looked at how the textual sentiment of a review a influence:2 type:2 pair index:432 citer id:397 citer title:designing novel review ranking systems citer abstract:with the rapid growth of the internet, users' ability to publish content has created active electronic communities that provide a wealth of product information. consumers naturally gravitate to reading reviews in order to decide whether to buy a product. however, the high volume of reviews that are typically published for a single product makes it harder for individuals to locate the best reviews and understand the true underlying quality of a product based on the reviews. similarly, the manufacturer of a product needs to identify the reviews that inuence the customer base, and examine the content of these reviews. in this paper, we propose two ranking mechanisms for ranking product reviews: a consumer-oriented ranking mechanism ranks the reviews according to their expected helpfulness, and a manufactureroriented ranking mechanism ranks the reviews according to their expected e?ect on sales. our ranking mechanism combines econometric analysis with text mining techniques in general, and with subjectivity analysis in particular. we show that subjectivity analysis can give useful clues about the helpfulness of a review and about its impact on sales. our results can have several implications for the market design of online opinion forums citee id:404 citee title:market distortions when agents are better informed: the value of information in real estate transactions citee abstract:agents are often better informed than the clients who hire them and may exploit this informational advantage. real-estate agents, who know much more about the housing market than the typical homeowner, are one example. because real estate agents receive only a small share of the incremental profit when a house sells for a higher value, there is an incentive for them to convince their clients to sell their houses too cheaply and too quickly. we test these predictions by comparing home sales in which real estate agents are hired by others to sell a home to instances in which a real estate agent sells his or her own home. in the former case, the agent has distorted incentives; in the latter case, the agent wants to pursue the first-best. consistent with the theory, we find homes owned by real estate agents sell for about 3.7 percent more than other houses and stay on the market about 9.5 days longer, even after controlling for a wide range of housing characteristics. situations in which the agent's informational advantage is larger lead to even greater distortions. surrounding text:our research papers aim to make a contribution by bridging these two streams of work. we also add to an emerging stream of literature that combines economic methods with text mining [5, 9, ***]<2>. for example, das and chen [5]<2> examined bulletin boards on yahoo influence:3 type:2 pair index:433 citer id:397 citer title:designing novel review ranking systems citer abstract:with the rapid growth of the internet, users' ability to publish content has created active electronic communities that provide a wealth of product information. consumers naturally gravitate to reading reviews in order to decide whether to buy a product. however, the high volume of reviews that are typically published for a single product makes it harder for individuals to locate the best reviews and understand the true underlying quality of a product based on the reviews. similarly, the manufacturer of a product needs to identify the reviews that inuence the customer base, and examine the content of these reviews. in this paper, we propose two ranking mechanisms for ranking product reviews: a consumer-oriented ranking mechanism ranks the reviews according to their expected helpfulness, and a manufactureroriented ranking mechanism ranks the reviews according to their expected e?ect on sales. our ranking mechanism combines econometric analysis with text mining techniques in general, and with subjectivity analysis in particular. we show that subjectivity analysis can give useful clues about the helpfulness of a review and about its impact on sales. our results can have several implications for the market design of online opinion forums citee id:114 citee title:opinion observer: analyzing and comparing opinions on the web citee abstract:the web has become an excellent source for gathering consumer opinions. there are now numerous web sites containing such opinions, e.g., customer reviews of products, forums, discussion groups, and blogs. this paper focuses on online customer reviews of products. it makes two contributions. first, it proposes a novel framework for analyzing and comparing consumer opinions of competing products. a prototype system called opinion observer is also implemented. the system is such that with a single glance of its visualization, the user is able to clearly see the strengths and weaknesses of each product in the minds of consumers in terms of various product features. this comparison is useful to both potential customers and product manufacturers. for a potential customer, he/she can see a visual side-by-side and feature-by-feature comparison of consumer opinions on these products, which helps him/her to decide which product to buy. for a product manufacturer, the comparison enables it to easily gather marketing intelligence and product benchmarking information. second, a new technique based on language pattern mining is proposed to extract product features from pros and cons in a particular type of reviews. such features form the basis for the above comparison. experimental results show that the technique is highly effective and outperform existing methods significantly. surrounding text:ects product sales on a market such as amazon. while prior work in computer science has extensively analyzed and classied sentiments in online opinions [12, 14, 15, ***, 21, 23]<2>, they have not examined the economic impact of the reviews. the rest of the paper is structured as follows. relatedwork our research program is inspired by previous studies about opinion strength analysis. while prior work in computer science has extensively analyzed and classied sentiments in online opinions [12, 14, 15, ***, 21, 23]<2>, they have not examined the economic impact of the reviews. similarly, while prior work has looked at how the average rating of a review or social factors (such as self-descriptive information) is related to the proportion of helpful votes received by a review, it has not looked at how the textual sentiment of a review a influence:2 type:2 pair index:434 citer id:397 citer title:designing novel review ranking systems citer abstract:with the rapid growth of the internet, users' ability to publish content has created active electronic communities that provide a wealth of product information. consumers naturally gravitate to reading reviews in order to decide whether to buy a product. however, the high volume of reviews that are typically published for a single product makes it harder for individuals to locate the best reviews and understand the true underlying quality of a product based on the reviews. similarly, the manufacturer of a product needs to identify the reviews that inuence the customer base, and examine the content of these reviews. in this paper, we propose two ranking mechanisms for ranking product reviews: a consumer-oriented ranking mechanism ranks the reviews according to their expected helpfulness, and a manufactureroriented ranking mechanism ranks the reviews according to their expected e?ect on sales. our ranking mechanism combines econometric analysis with text mining techniques in general, and with subjectivity analysis in particular. we show that subjectivity analysis can give useful clues about the helpfulness of a review and about its impact on sales. our results can have several implications for the market design of online opinion forums citee id:174 citee title:a sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts citee abstract:sentiment analysis seeks to identify the viewpoint(s) underlying a text span; an example application is classifying a movie review as "thumbs up" or "thumbs down". to determine this sentiment polarity, we propose a novel machine-learning method that applies text-categorization techniques to just the subjective portions of the document. extracting these portions can be implemented using efficient techniques for finding minimum cuts in graphs; this greatly facilitates incorporation of cross-sentence contextual constraints surrounding text:the other types of reviews are the reviews with \subjective," sentimental information, in which the reviewers give a very personal description of the product, and give information that typically does not appear in the ocial description of the product. as a rst step towards understanding the impact of the textual content of the reviews on product sales, we rely on existing literature of subjectivity estimation from computational linguistics [***]<1>. specically, pang and lee [***]<1> described a technique that identies which sentences in a text convey objective information, and which of them contain subjective elements. as a rst step towards understanding the impact of the textual content of the reviews on product sales, we rely on existing literature of subjectivity estimation from computational linguistics [***]<1>. specically, pang and lee [***]<1> described a technique that identies which sentences in a text convey objective information, and which of them contain subjective elements. pang and lee applied their techniques in a data set with movie review data set, in which they considered as objective information the movie plot, and as subjective the information that appeared in the reviews influence:1 type:1 pair index:435 citer id:397 citer title:designing novel review ranking systems citer abstract:with the rapid growth of the internet, users' ability to publish content has created active electronic communities that provide a wealth of product information. consumers naturally gravitate to reading reviews in order to decide whether to buy a product. however, the high volume of reviews that are typically published for a single product makes it harder for individuals to locate the best reviews and understand the true underlying quality of a product based on the reviews. similarly, the manufacturer of a product needs to identify the reviews that inuence the customer base, and examine the content of these reviews. in this paper, we propose two ranking mechanisms for ranking product reviews: a consumer-oriented ranking mechanism ranks the reviews according to their expected helpfulness, and a manufactureroriented ranking mechanism ranks the reviews according to their expected e?ect on sales. our ranking mechanism combines econometric analysis with text mining techniques in general, and with subjectivity analysis in particular. we show that subjectivity analysis can give useful clues about the helpfulness of a review and about its impact on sales. our results can have several implications for the market design of online opinion forums citee id:405 citee title:sentiment classication using machine learning techniques citee abstract:we consider the problem of classifying documents not by topic, but by overall sentiment, e.g., determining whether a review is positive or negative. using movie reviews as data, we find that standard machine learning techniques definitively outperform human-produced baselines. however, the three machine learning methods we employed (naive bayes, maximum entropy classification, and support vector machines) do not perform as well on sentiment classification as on traditional topic-based categorization. we conclude by examining factors that make the sentiment classification problem more challenging. surrounding text:ects product sales on a market such as amazon. while prior work in computer science has extensively analyzed and classied sentiments in online opinions [12, 14, 15, 17, ***, 23]<2>, they have not examined the economic impact of the reviews. the rest of the paper is structured as follows. relatedwork our research program is inspired by previous studies about opinion strength analysis. while prior work in computer science has extensively analyzed and classied sentiments in online opinions [12, 14, 15, 17, ***, 23]<2>, they have not examined the economic impact of the reviews. similarly, while prior work has looked at how the average rating of a review or social factors (such as self-descriptive information) is related to the proportion of helpful votes received by a review, it has not looked at how the textual sentiment of a review a influence:2 type:2 pair index:436 citer id:397 citer title:designing novel review ranking systems citer abstract:with the rapid growth of the internet, users' ability to publish content has created active electronic communities that provide a wealth of product information. consumers naturally gravitate to reading reviews in order to decide whether to buy a product. however, the high volume of reviews that are typically published for a single product makes it harder for individuals to locate the best reviews and understand the true underlying quality of a product based on the reviews. similarly, the manufacturer of a product needs to identify the reviews that inuence the customer base, and examine the content of these reviews. in this paper, we propose two ranking mechanisms for ranking product reviews: a consumer-oriented ranking mechanism ranks the reviews according to their expected helpfulness, and a manufactureroriented ranking mechanism ranks the reviews according to their expected e?ect on sales. our ranking mechanism combines econometric analysis with text mining techniques in general, and with subjectivity analysis in particular. we show that subjectivity analysis can give useful clues about the helpfulness of a review and about its impact on sales. our results can have several implications for the market design of online opinion forums citee id:407 citee title:thumbs up or thumbs down? semantic orientation applied to unsupervised classication of reviews citee abstract:this paper presents a simple unsupervised learning algorithm for classifying reviews as recommended (thumbs up) or not recommended (thumbs down). the classification of a review is predicted by the average semantic orientation of the phrases in the review that contain adjectives or adverbs. a phrase has a positive semantic orientation when it has good associations (e.g., subtle nuances) and a negative semantic orientation when it has bad associations (e.g., very cavalier). in this paper, the semantic orientation of a phrase is calculated as the mutual information between the given phrase and the word excellent minus the mutual information between the given phrase and the word poor. a review is classified as recommended if the average semantic orientation of its phrases is positive. the algorithm achieves an average accuracy of 74% when evaluated on 410 reviews from epinions, sampled from four different domains (reviews of automobiles, banks, movies, and travel destinations). the accuracy ranges from 84% for automobile reviews to 66% for movie reviews surrounding text:ects product sales on a market such as amazon. while prior work in computer science has extensively analyzed and classied sentiments in online opinions [12, 14, 15, 17, 21, ***]<2>, they have not examined the economic impact of the reviews. the rest of the paper is structured as follows. relatedwork our research program is inspired by previous studies about opinion strength analysis. while prior work in computer science has extensively analyzed and classied sentiments in online opinions [12, 14, 15, 17, 21, ***]<2>, they have not examined the economic impact of the reviews. similarly, while prior work has looked at how the average rating of a review or social factors (such as self-descriptive information) is related to the proportion of helpful votes received by a review, it has not looked at how the textual sentiment of a review a influence:2 type:2 pair index:437 citer id:403 citer title:mining and summarizing customer reviews citer abstract:merchants selling products on the web often ask their customers to review the products that they have purchased and the associated services. as e-commerce is becoming more and more popular, the number of customer reviews that a product receives grows rapidly. for a popular product, the number of reviews can be in hundreds or even thousands. this makes it difficult for a potential customer to read them to make an informed decision on whether to purchase the product. it also makes it difficult for the manufacturer of the product to keep track and to manage customer opinions. for the manufacturer, there are additional difficulties because many merchant sites may sell the same product and the manufacturer normally produces many kinds of products. in this research, we aim to mine and to summarize all the customer reviews of a product. this summarization task is different from traditional text summarization because we only mine the features of the product on which the customers have expressed their opinions and whether the opinions are positive or negative. we do not summarize the reviews by selecting a subset or rewrite some of the original sentences from the reviews to capture the main points as in the classic text summarization. our task is performed in three steps: (1) mining product features that have been commented on by customers; (2) identifying opinion sentences in each review and deciding whether each opinion sentence is positive or negative; (3) summarizing the results. this paper proposes several novel techniques to perform these tasks. our experimental results using reviews of a number of products sold online demonstrate the effectiveness of the techniques citee id:726 citee title:recognizing subjectivity: a case study of manual tagging citee abstract:in this paper, we describe a case study of a sentence-level categorization in which tagging instructions are developed and used by four judges to classify clauses from the wall street journal as either subjective or objective. agreement among the four judges is analyzed, and, based on that analysis, each clause is given a final classification. to provide empirical support for the classifications, correlations are assessed in the data between the subjective category and a basic semantic class surrounding text:clearly, this is related to existing work on distinguishing sentences used to express subjective opinions from sentences used to objectively describe some factual information [43]<2>. previous work on subjectivity [44, ***]<2> has established a positive statistically significant correlation with the presence of adjectives. thus the presence of adjectives is useful for predicting whether a sentence is subjective, i influence:2 type:2 pair index:438 citer id:403 citer title:mining and summarizing customer reviews citer abstract:merchants selling products on the web often ask their customers to review the products that they have purchased and the associated services. as e-commerce is becoming more and more popular, the number of customer reviews that a product receives grows rapidly. for a popular product, the number of reviews can be in hundreds or even thousands. this makes it difficult for a potential customer to read them to make an informed decision on whether to purchase the product. it also makes it difficult for the manufacturer of the product to keep track and to manage customer opinions. for the manufacturer, there are additional difficulties because many merchant sites may sell the same product and the manufacturer normally produces many kinds of products. in this research, we aim to mine and to summarize all the customer reviews of a product. this summarization task is different from traditional text summarization because we only mine the features of the product on which the customers have expressed their opinions and whether the opinions are positive or negative. we do not summarize the reviews by selecting a subset or rewrite some of the original sentences from the reviews to capture the main points as in the classic text summarization. our task is performed in three steps: (1) mining product features that have been commented on by customers; (2) identifying opinion sentences in each review and deciding whether each opinion sentence is positive or negative; (3) summarizing the results. this paper proposes several novel techniques to perform these tasks. our experimental results using reviews of a number of products sold online demonstrate the effectiveness of the techniques citee id:333 citee title:combining low-level and summary representations of opinions for multi-perspective question answering citee abstract:while much recent progress has been made in research on fact-based question answering, our work aims to extend question-answering research in a different direction --- to handle multi-perspective question-answering tasks, i.e. question-answering tasks that require an ability to find and organize opinions in text. in particular, this paper proposes an approach to multi-perspective question answering that views the task as one of opinion-oriented information extraction. we first describe an surrounding text:, doesnt work, benchmark result and no problem(s)). in [***]<2>, cardie et al discuss opinion-oriented information extraction. they aim to create summary representations of opinions to perform question answering influence:3 type:2 pair index:439 citer id:403 citer title:mining and summarizing customer reviews citer abstract:merchants selling products on the web often ask their customers to review the products that they have purchased and the associated services. as e-commerce is becoming more and more popular, the number of customer reviews that a product receives grows rapidly. for a popular product, the number of reviews can be in hundreds or even thousands. this makes it difficult for a potential customer to read them to make an informed decision on whether to purchase the product. it also makes it difficult for the manufacturer of the product to keep track and to manage customer opinions. for the manufacturer, there are additional difficulties because many merchant sites may sell the same product and the manufacturer normally produces many kinds of products. in this research, we aim to mine and to summarize all the customer reviews of a product. this summarization task is different from traditional text summarization because we only mine the features of the product on which the customers have expressed their opinions and whether the opinions are positive or negative. we do not summarize the reviews by selecting a subset or rewrite some of the original sentences from the reviews to capture the main points as in the classic text summarization. our task is performed in three steps: (1) mining product features that have been commented on by customers; (2) identifying opinion sentences in each review and deciding whether each opinion sentence is positive or negative; (3) summarizing the results. this paper proposes several novel techniques to perform these tasks. our experimental results using reviews of a number of products sold online demonstrate the effectiveness of the techniques citee id:727 citee title:word association norms, mutual information and lexicography citee abstract:the term word association is used in a very particular sense in the psycholinguistic literature. (generally speaking, subjects respond quicker than normal to the word nurse if it follows a highly associated word such as doctor. ) we will extend the term to provide the basis for a statistical description of a variety of interesting linguistic phenomena, ranging from semantic relations of the doctor/nurse type (content word/content word) to lexico-syntactic co-occurrence constraints between verbs and prepositions (content word/function word). this paper will propose an objective measure based on the information theoretic notion of mutual information, for estimating word association norms from computer readable corpora. (the standard method of obtaining word association norms, testing a few thousand subjects on a few hundred words, is both costly and unreliable.) the proposed measure, the association ratio, estimates word association norms directly from computer readable corpora, making it possible to estimate norms for tens of thousands of words surrounding text:2. 4 terminology finding in terminology finding, there are basically two techniques for discovering terms in corpora: symbolic approaches that rely on syntactic description of terms, namely noun phrases, and statistical approaches that exploit the fact that the words composing a term tend to be found close to each other and reoccurring [21, 22, 7, ***]<2>. however, using noun phrases tends to produce too many non-terms (low precision), while using reoccurring phrases misses many low frequency terms, terms with variations, and terms with only one word influence:3 type:3 pair index:440 citer id:403 citer title:mining and summarizing customer reviews citer abstract:merchants selling products on the web often ask their customers to review the products that they have purchased and the associated services. as e-commerce is becoming more and more popular, the number of customer reviews that a product receives grows rapidly. for a popular product, the number of reviews can be in hundreds or even thousands. this makes it difficult for a potential customer to read them to make an informed decision on whether to purchase the product. it also makes it difficult for the manufacturer of the product to keep track and to manage customer opinions. for the manufacturer, there are additional difficulties because many merchant sites may sell the same product and the manufacturer normally produces many kinds of products. in this research, we aim to mine and to summarize all the customer reviews of a product. this summarization task is different from traditional text summarization because we only mine the features of the product on which the customers have expressed their opinions and whether the opinions are positive or negative. we do not summarize the reviews by selecting a subset or rewrite some of the original sentences from the reviews to capture the main points as in the classic text summarization. our task is performed in three steps: (1) mining product features that have been commented on by customers; (2) identifying opinion sentences in each review and deciding whether each opinion sentence is positive or negative; (3) summarizing the results. this paper proposes several novel techniques to perform these tasks. our experimental results using reviews of a number of products sold online demonstrate the effectiveness of the techniques citee id:728 citee title:study and implementation of combined techniques for automatic extraction of terminology citee abstract:this paper presents an original method and its implementation to extract terminology from corpora by combining linguistic filters and statistical methods. starting from a linguistic study of the terms of telecommunication domain, we designed a number of filters which enable us to obtain a first selection of sequences that may be considered as terms. various statistical scores are applied to this selection and results are evaluated. this method has been applied to french and to english, but this surrounding text:2. 4 terminology finding in terminology finding, there are basically two techniques for discovering terms in corpora: symbolic approaches that rely on syntactic description of terms, namely noun phrases, and statistical approaches that exploit the fact that the words composing a term tend to be found close to each other and reoccurring [21, 22, ***, 6]<2>. however, using noun phrases tends to produce too many non-terms (low precision), while using reoccurring phrases misses many low frequency terms, terms with variations, and terms with only one word influence:3 type:3 pair index:441 citer id:403 citer title:mining and summarizing customer reviews citer abstract:merchants selling products on the web often ask their customers to review the products that they have purchased and the associated services. as e-commerce is becoming more and more popular, the number of customer reviews that a product receives grows rapidly. for a popular product, the number of reviews can be in hundreds or even thousands. this makes it difficult for a potential customer to read them to make an informed decision on whether to purchase the product. it also makes it difficult for the manufacturer of the product to keep track and to manage customer opinions. for the manufacturer, there are additional difficulties because many merchant sites may sell the same product and the manufacturer normally produces many kinds of products. in this research, we aim to mine and to summarize all the customer reviews of a product. this summarization task is different from traditional text summarization because we only mine the features of the product on which the customers have expressed their opinions and whether the opinions are positive or negative. we do not summarize the reviews by selecting a subset or rewrite some of the original sentences from the reviews to capture the main points as in the classic text summarization. our task is performed in three steps: (1) mining product features that have been commented on by customers; (2) identifying opinion sentences in each review and deciding whether each opinion sentence is positive or negative; (3) summarizing the results. this paper proposes several novel techniques to perform these tasks. our experimental results using reviews of a number of products sold online demonstrate the effectiveness of the techniques citee id:629 citee title:mining the peanut gallery: opinion extraction and semantic classification of product reviews citee abstract:the web contains a wealth of product reviews, but sifting through them is a daunting task. ideally, an opinion mining tool would process a set of search results for a given item, generating a list of product attributes (quality, features, etc.) and aggregating opinions about each of them (poor, mixed, good). we begin by identifying the unique properties of this problem and develop a method for automatically distinguishing between positive and negative reviews. our classifier draws on information retrieval techniques for feature extraction and scoring, and the results for various metrics and heuristics vary depending on the testing situation. the best methods work as well as or better than traditional machine learning. when operating on individual sentences collected from web searches, performance is limited due to noise and ambiguity. but in the context of a complete web-based tool and aided by a simple method for grouping sentences into attributes, the results are qualitatively quite useful. surrounding text:2. related work our work is closely related to dave, lawrence and pennocks work in [***]<1> on semantic classification of reviews. using available training corpus from some web sites, where each review already has a class (e influence:1 type:2 pair index:442 citer id:403 citer title:mining and summarizing customer reviews citer abstract:merchants selling products on the web often ask their customers to review the products that they have purchased and the associated services. as e-commerce is becoming more and more popular, the number of customer reviews that a product receives grows rapidly. for a popular product, the number of reviews can be in hundreds or even thousands. this makes it difficult for a potential customer to read them to make an informed decision on whether to purchase the product. it also makes it difficult for the manufacturer of the product to keep track and to manage customer opinions. for the manufacturer, there are additional difficulties because many merchant sites may sell the same product and the manufacturer normally produces many kinds of products. in this research, we aim to mine and to summarize all the customer reviews of a product. this summarization task is different from traditional text summarization because we only mine the features of the product on which the customers have expressed their opinions and whether the opinions are positive or negative. we do not summarize the reviews by selecting a subset or rewrite some of the original sentences from the reviews to capture the main points as in the classic text summarization. our task is performed in three steps: (1) mining product features that have been commented on by customers; (2) identifying opinion sentences in each review and deciding whether each opinion sentence is positive or negative; (3) summarizing the results. this paper proposes several novel techniques to perform these tasks. our experimental results using reviews of a number of products sold online demonstrate the effectiveness of the techniques citee id:631 citee title:wordnet: an electronic lexical database citee abstract:wordnet is perhaps the most important and widely used lexical resource for natural language processing systems up to now. wordnet: an electronic lexical database, edited by christiane fellbaum, discusses the design of wordnet from both theoretical and historical perspectives, provides an up-to-date description of the lexical database, and presents a set of applications of wordnet. the book contains a foreword by george miller, an introduction by christiane fellbaum, seven chapters from the cognitive sciences laboratory of princeton university, where wordnet was produced, and nine chapters contributed by scientists from elsewhere. surrounding text:, positive or negative. a bootstrapping technique is proposed to perform this task using wordnet [29, ***]<1>. finally, we decide the opinion orientation of each sentence influence:3 type:3 pair index:443 citer id:403 citer title:mining and summarizing customer reviews citer abstract:merchants selling products on the web often ask their customers to review the products that they have purchased and the associated services. as e-commerce is becoming more and more popular, the number of customer reviews that a product receives grows rapidly. for a popular product, the number of reviews can be in hundreds or even thousands. this makes it difficult for a potential customer to read them to make an informed decision on whether to purchase the product. it also makes it difficult for the manufacturer of the product to keep track and to manage customer opinions. for the manufacturer, there are additional difficulties because many merchant sites may sell the same product and the manufacturer normally produces many kinds of products. in this research, we aim to mine and to summarize all the customer reviews of a product. this summarization task is different from traditional text summarization because we only mine the features of the product on which the customers have expressed their opinions and whether the opinions are positive or negative. we do not summarize the reviews by selecting a subset or rewrite some of the original sentences from the reviews to capture the main points as in the classic text summarization. our task is performed in three steps: (1) mining product features that have been commented on by customers; (2) identifying opinion sentences in each review and deciding whether each opinion sentence is positive or negative; (3) summarizing the results. this paper proposes several novel techniques to perform these tasks. our experimental results using reviews of a number of products sold online demonstrate the effectiveness of the techniques citee id:584 citee title:genre classification and domain transfer for information filtering citee abstract:the world wide web is a vast repository of information, but the sheer volume makes it dificult to identify useful documents.w e identify document genre is an important factor in retrieving useful documents and focus on the novel document genre dimension of subjectivity. we investigate three approaches to automatically classifying documents by genre: traditional bag of words techniques, part-of-speech statistics, and hand-crafted shallow linguistic features. we are particularly interested in domain transfer: how well the learned classifiers generalize from the training corpus to a new document corpus.our experiments demonstrate that the part-of-speech approach is better than traditional bag of words techniques, particularly in the domain transfer conditions. surrounding text:, editorial, novel, news, poem etc. although some techniques for genre classification can recognize documents that express opinions [23, 24, ***]<2>, they do not tell whether the opinions are positive or negative. in our work, we need to determine whether an opinion is positive or negative and to perform opinion classification at the sentence level rather than at the document level influence:2 type:2 pair index:444 citer id:403 citer title:mining and summarizing customer reviews citer abstract:merchants selling products on the web often ask their customers to review the products that they have purchased and the associated services. as e-commerce is becoming more and more popular, the number of customer reviews that a product receives grows rapidly. for a popular product, the number of reviews can be in hundreds or even thousands. this makes it difficult for a potential customer to read them to make an informed decision on whether to purchase the product. it also makes it difficult for the manufacturer of the product to keep track and to manage customer opinions. for the manufacturer, there are additional difficulties because many merchant sites may sell the same product and the manufacturer normally produces many kinds of products. in this research, we aim to mine and to summarize all the customer reviews of a product. this summarization task is different from traditional text summarization because we only mine the features of the product on which the customers have expressed their opinions and whether the opinions are positive or negative. we do not summarize the reviews by selecting a subset or rewrite some of the original sentences from the reviews to capture the main points as in the classic text summarization. our task is performed in three steps: (1) mining product features that have been commented on by customers; (2) identifying opinion sentences in each review and deciding whether each opinion sentence is positive or negative; (3) summarizing the results. this paper proposes several novel techniques to perform these tasks. our experimental results using reviews of a number of products sold online demonstrate the effectiveness of the techniques citee id:729 citee title:summarizing text documents: sentence selection and evaluation metrics citee abstract:human-quality text summarization systems are difficult to design, and even more difficult to evaluate, in part because documents can differ along several dimensions, such as length, writing style and lexical usage. nevertheless, certain cues can often help suggest the selection of sentences for inclusion in a summary. this paper presents our analysis of news-article summaries generated by sentence selection. sentences are ranked for potential inclusion in the summary using a weighted surrounding text:for a manufacturer, it is possible to combine summaries from multiple merchant sites to produce a single report for each of its products. our task is different from traditional text summarization [***, 39, 36]<2> in a number of ways. first of all, a summary in our case is structured rather than another (but shorter) free text document as produced by most text summarization systems influence:3 type:3 pair index:445 citer id:403 citer title:mining and summarizing customer reviews citer abstract:merchants selling products on the web often ask their customers to review the products that they have purchased and the associated services. as e-commerce is becoming more and more popular, the number of customer reviews that a product receives grows rapidly. for a popular product, the number of reviews can be in hundreds or even thousands. this makes it difficult for a potential customer to read them to make an informed decision on whether to purchase the product. it also makes it difficult for the manufacturer of the product to keep track and to manage customer opinions. for the manufacturer, there are additional difficulties because many merchant sites may sell the same product and the manufacturer normally produces many kinds of products. in this research, we aim to mine and to summarize all the customer reviews of a product. this summarization task is different from traditional text summarization because we only mine the features of the product on which the customers have expressed their opinions and whether the opinions are positive or negative. we do not summarize the reviews by selecting a subset or rewrite some of the original sentences from the reviews to capture the main points as in the classic text summarization. our task is performed in three steps: (1) mining product features that have been commented on by customers; (2) identifying opinion sentences in each review and deciding whether each opinion sentence is positive or negative; (3) summarizing the results. this paper proposes several novel techniques to perform these tasks. our experimental results using reviews of a number of products sold online demonstrate the effectiveness of the techniques citee id:459 citee title:effects of adjective orientation and gradability on sentence subjectivity citee abstract:subjectivity is a pragmatic, sentence-level feature that has important implications for texl processing applicalions such as information exlractiou and information iclricwd. we study tile elfeels of dymunic adjectives, semantically oriented adjectives, and gradable ad.ieclivcs on a simple subjectivity classiiicr, and establish lhat lhcy arc strong predictors of subjectivity. a novel trainable mclhod thai statistically combines two indicators of gradability is presented and ewlhlalcd, complementing exisling automatic icchniques for assigning orientation labels. surrounding text:in our work, we need to determine whether an opinion is positive or negative and to perform opinion classification at the sentence level rather than at the document level. a more closely related work is [***]<2>, in which the authors investigate sentence subjectivity classification and concludes that the presence and type of adjectives in a sentence is indicative of whether the sentence is subjective or objective. however, their work does not address our specific task of determining the semantic orientations of those subjective sentences. g. , external, digital) [***]<2>. in this work, we are interested in only positive and negative orientations influence:1 type:2 pair index:446 citer id:403 citer title:mining and summarizing customer reviews citer abstract:merchants selling products on the web often ask their customers to review the products that they have purchased and the associated services. as e-commerce is becoming more and more popular, the number of customer reviews that a product receives grows rapidly. for a popular product, the number of reviews can be in hundreds or even thousands. this makes it difficult for a potential customer to read them to make an informed decision on whether to purchase the product. it also makes it difficult for the manufacturer of the product to keep track and to manage customer opinions. for the manufacturer, there are additional difficulties because many merchant sites may sell the same product and the manufacturer normally produces many kinds of products. in this research, we aim to mine and to summarize all the customer reviews of a product. this summarization task is different from traditional text summarization because we only mine the features of the product on which the customers have expressed their opinions and whether the opinions are positive or negative. we do not summarize the reviews by selecting a subset or rewrite some of the original sentences from the reviews to capture the main points as in the classic text summarization. our task is performed in three steps: (1) mining product features that have been commented on by customers; (2) identifying opinion sentences in each review and deciding whether each opinion sentence is positive or negative; (3) summarizing the results. this paper proposes several novel techniques to perform these tasks. our experimental results using reviews of a number of products sold online demonstrate the effectiveness of the techniques citee id:417 citee title:direction-based text interpretation as an information access refinement citee abstract:a text-based intelligent system should provide more in-depth information about the contents of its corpus than does a standard information retrieval system, while at the same time avoiding the complexity and resource-consuming behavior of detailed text understanders. instead of focusing on discovering documents that pertain to some topic of interest to the user, an approach is introduced based on the criterion of directionality (e.g., is the agent in favor of, neutral, or opposed to the event?). a method is described for coercing sentence meanings into a metaphoric model such that the only semantic interpretation needed in order to determine the directionality of a sentence is done with respect to the model. this interpretation method is designed to be an integrated component of a hybrid information access system. surrounding text:2. 2 sentiment classification works of hearst [***]<2> and sack [35]<2> on sentiment-based classification of entire documents use models inspired by cognitive linguistics. das and chen [8]<2> use a manually crafted lexicon in conjunction with several scoring methods to classify stock postings on an investor bulletin influence:2 type:2 pair index:447 citer id:403 citer title:mining and summarizing customer reviews citer abstract:merchants selling products on the web often ask their customers to review the products that they have purchased and the associated services. as e-commerce is becoming more and more popular, the number of customer reviews that a product receives grows rapidly. for a popular product, the number of reviews can be in hundreds or even thousands. this makes it difficult for a potential customer to read them to make an informed decision on whether to purchase the product. it also makes it difficult for the manufacturer of the product to keep track and to manage customer opinions. for the manufacturer, there are additional difficulties because many merchant sites may sell the same product and the manufacturer normally produces many kinds of products. in this research, we aim to mine and to summarize all the customer reviews of a product. this summarization task is different from traditional text summarization because we only mine the features of the product on which the customers have expressed their opinions and whether the opinions are positive or negative. we do not summarize the reviews by selecting a subset or rewrite some of the original sentences from the reviews to capture the main points as in the classic text summarization. our task is performed in three steps: (1) mining product features that have been commented on by customers; (2) identifying opinion sentences in each review and deciding whether each opinion sentence is positive or negative; (3) summarizing the results. this paper proposes several novel techniques to perform these tasks. our experimental results using reviews of a number of products sold online demonstrate the effectiveness of the techniques citee id:540 citee title:mining opinion features in customer reviews citee abstract:it is a common practice that merchants selling products on the web ask their customers to review the products and associated services. as e-commerce is becoming more and more popular, the number of customer reviews that a product receives grows rapidly. for a popular product, the number of reviews can be in hundreds. this makes it difficult for a potential customer to read them in order to make a decision on whether to buy the product. in this project, we aim to summarize all the customer reviews of a product. this summarization task is different from traditional text summarization because we are only interested in the specific features of the product that customers have opinions on and also whether the opinions are positive or negative. we do not summarize the reviews by selecting or rewriting a subset of the original sentences from the reviews to capture their main points as in the classic text summarization. in this paper, we only focus on mining opinion/product features that the reviewers have commented on. a number of techniques are presented to mine such features. our experimental results show that these techniques are highly effective. surrounding text:we make use of both data mining and natural language processing techniques to perform this task. this part of the study has been reported in [***]<0>, however, for completeness, we will summarize its techniques in this paper and also present a comparative evaluation. (2) identifying opinion sentences in each review and deciding whether each opinion sentence is positive or negative. compactness pruning aims to prune those candidate features whose words do not appear together in a specific order. see [***]<0> for the detailed definition of compactness and also the pruning procedure. redundancy pruning: in this step, we focus on removing redundant features that contain single words. for instance, life by itself is not a useful feature while battery life is a meaningful feature phrase. see [***]<0> for more explanations. 3 influence:2 type:2 pair index:448 citer id:403 citer title:mining and summarizing customer reviews citer abstract:merchants selling products on the web often ask their customers to review the products that they have purchased and the associated services. as e-commerce is becoming more and more popular, the number of customer reviews that a product receives grows rapidly. for a popular product, the number of reviews can be in hundreds or even thousands. this makes it difficult for a potential customer to read them to make an informed decision on whether to purchase the product. it also makes it difficult for the manufacturer of the product to keep track and to manage customer opinions. for the manufacturer, there are additional difficulties because many merchant sites may sell the same product and the manufacturer normally produces many kinds of products. in this research, we aim to mine and to summarize all the customer reviews of a product. this summarization task is different from traditional text summarization because we only mine the features of the product on which the customers have expressed their opinions and whether the opinions are positive or negative. we do not summarize the reviews by selecting a subset or rewrite some of the original sentences from the reviews to capture the main points as in the classic text summarization. our task is performed in three steps: (1) mining product features that have been commented on by customers; (2) identifying opinion sentences in each review and deciding whether each opinion sentence is positive or negative; (3) summarizing the results. this paper proposes several novel techniques to perform these tasks. our experimental results using reviews of a number of products sold online demonstrate the effectiveness of the techniques citee id:575 citee title:fuzzy typing for document management citee abstract:this prototype system demonstrates a novel method of document analysis and management, based on a combination of techniques from nlp and fuzzy logic. since the central technique we use from nlp is semantic typing, we refer to this approach as fuzzy typing for document management surrounding text:das and chen [8]<2> use a manually crafted lexicon in conjunction with several scoring methods to classify stock postings on an investor bulletin. huettner and subasic [***]<2> also manually construct a discriminant-word lexicon and use fuzzy logic to classify sentiments. tong [41]<2> generates sentiment timelines influence:2 type:3 pair index:449 citer id:403 citer title:mining and summarizing customer reviews citer abstract:merchants selling products on the web often ask their customers to review the products that they have purchased and the associated services. as e-commerce is becoming more and more popular, the number of customer reviews that a product receives grows rapidly. for a popular product, the number of reviews can be in hundreds or even thousands. this makes it difficult for a potential customer to read them to make an informed decision on whether to purchase the product. it also makes it difficult for the manufacturer of the product to keep track and to manage customer opinions. for the manufacturer, there are additional difficulties because many merchant sites may sell the same product and the manufacturer normally produces many kinds of products. in this research, we aim to mine and to summarize all the customer reviews of a product. this summarization task is different from traditional text summarization because we only mine the features of the product on which the customers have expressed their opinions and whether the opinions are positive or negative. we do not summarize the reviews by selecting a subset or rewrite some of the original sentences from the reviews to capture the main points as in the classic text summarization. our task is performed in three steps: (1) mining product features that have been commented on by customers; (2) identifying opinion sentences in each review and deciding whether each opinion sentence is positive or negative; (3) summarizing the results. this paper proposes several novel techniques to perform these tasks. our experimental results using reviews of a number of products sold online demonstrate the effectiveness of the techniques citee id:730 citee title:term extraction and automatic indexing citee abstract:this chapter presents a new domain of research and development in natural language processing (nlp) that is concerned with the representation, acquisition, and recognition of terms. terms are pervasive in scientific and technical documents; their identification is a crucial issue for any application dealing with the analysis, understanding, generation, or translation of such documents. in particular, the ever-growing mass of specialized documentation available on-line, in industrial and surrounding text:2. 4 terminology finding in terminology finding, there are basically two techniques for discovering terms in corpora: symbolic approaches that rely on syntactic description of terms, namely noun phrases, and statistical approaches that exploit the fact that the words composing a term tend to be found close to each other and reoccurring [***, 22, 7, 6]<2>. however, using noun phrases tends to produce too many non-terms (low precision), while using reoccurring phrases misses many low frequency terms, terms with variations, and terms with only one word influence:3 type:3 pair index:450 citer id:403 citer title:mining and summarizing customer reviews citer abstract:merchants selling products on the web often ask their customers to review the products that they have purchased and the associated services. as e-commerce is becoming more and more popular, the number of customer reviews that a product receives grows rapidly. for a popular product, the number of reviews can be in hundreds or even thousands. this makes it difficult for a potential customer to read them to make an informed decision on whether to purchase the product. it also makes it difficult for the manufacturer of the product to keep track and to manage customer opinions. for the manufacturer, there are additional difficulties because many merchant sites may sell the same product and the manufacturer normally produces many kinds of products. in this research, we aim to mine and to summarize all the customer reviews of a product. this summarization task is different from traditional text summarization because we only mine the features of the product on which the customers have expressed their opinions and whether the opinions are positive or negative. we do not summarize the reviews by selecting a subset or rewrite some of the original sentences from the reviews to capture the main points as in the classic text summarization. our task is performed in three steps: (1) mining product features that have been commented on by customers; (2) identifying opinion sentences in each review and deciding whether each opinion sentence is positive or negative; (3) summarizing the results. this paper proposes several novel techniques to perform these tasks. our experimental results using reviews of a number of products sold online demonstrate the effectiveness of the techniques citee id:731 citee title:recognizing text genres with simple metrics using discriminant analysis citee abstract:a simple method for categorizing texts into pre-determined text genre categories using the statistical standard technique of discriminant analysis is demonstrated with application to the brown corpus. discriminant analysis makes it possible use a large number of parameters that may be specific for a certain corpus or information stream, and combine them into a small number of functions, with the parameters weighted on basis of how useful they are for discriminating text genres. an application to information retrieval is discussed surrounding text:, editorial, novel, news, poem etc. although some techniques for genre classification can recognize documents that express opinions [***, 24, 14]<2>, they do not tell whether the opinions are positive or negative. in our work, we need to determine whether an opinion is positive or negative and to perform opinion classification at the sentence level rather than at the document level influence:2 type:2 pair index:451 citer id:403 citer title:mining and summarizing customer reviews citer abstract:merchants selling products on the web often ask their customers to review the products that they have purchased and the associated services. as e-commerce is becoming more and more popular, the number of customer reviews that a product receives grows rapidly. for a popular product, the number of reviews can be in hundreds or even thousands. this makes it difficult for a potential customer to read them to make an informed decision on whether to purchase the product. it also makes it difficult for the manufacturer of the product to keep track and to manage customer opinions. for the manufacturer, there are additional difficulties because many merchant sites may sell the same product and the manufacturer normally produces many kinds of products. in this research, we aim to mine and to summarize all the customer reviews of a product. this summarization task is different from traditional text summarization because we only mine the features of the product on which the customers have expressed their opinions and whether the opinions are positive or negative. we do not summarize the reviews by selecting a subset or rewrite some of the original sentences from the reviews to capture the main points as in the classic text summarization. our task is performed in three steps: (1) mining product features that have been commented on by customers; (2) identifying opinion sentences in each review and deciding whether each opinion sentence is positive or negative; (3) summarizing the results. this paper proposes several novel techniques to perform these tasks. our experimental results using reviews of a number of products sold online demonstrate the effectiveness of the techniques citee id:266 citee title:automatic detection of text genre citee abstract:as the text databases available to users become larger and more heterogeneous, genre becomes increasingly important for computational linguistics as a complement to topical and structural principles of classification. we propose a theory of genres as bundles of facets, which correlate with various surface cues, and argue that genre detection based on surface cues is as successful as detection based on deeper structural properties surrounding text:, editorial, novel, news, poem etc. although some techniques for genre classification can recognize documents that express opinions [23, ***, 14]<2>, they do not tell whether the opinions are positive or negative. in our work, we need to determine whether an opinion is positive or negative and to perform opinion classification at the sentence level rather than at the document level influence:2 type:2 pair index:452 citer id:403 citer title:mining and summarizing customer reviews citer abstract:merchants selling products on the web often ask their customers to review the products that they have purchased and the associated services. as e-commerce is becoming more and more popular, the number of customer reviews that a product receives grows rapidly. for a popular product, the number of reviews can be in hundreds or even thousands. this makes it difficult for a potential customer to read them to make an informed decision on whether to purchase the product. it also makes it difficult for the manufacturer of the product to keep track and to manage customer opinions. for the manufacturer, there are additional difficulties because many merchant sites may sell the same product and the manufacturer normally produces many kinds of products. in this research, we aim to mine and to summarize all the customer reviews of a product. this summarization task is different from traditional text summarization because we only mine the features of the product on which the customers have expressed their opinions and whether the opinions are positive or negative. we do not summarize the reviews by selecting a subset or rewrite some of the original sentences from the reviews to capture the main points as in the classic text summarization. our task is performed in three steps: (1) mining product features that have been commented on by customers; (2) identifying opinion sentences in each review and deciding whether each opinion sentence is positive or negative; (3) summarizing the results. this paper proposes several novel techniques to perform these tasks. our experimental results using reviews of a number of products sold online demonstrate the effectiveness of the techniques citee id:658 citee title:integrating classification and association rule mining citee abstract:concept lattice is an efficient tool for data analysis. in this paper we show how classification and association rule mining can be unified under concept lattice framework. we present a fast algorithm to extract association and classification rules from concept lattice surrounding text:those noun/noun phrases that are infrequent are likely to be non-product features. we run the association miner cba [***]<1>, which is based on the apriori algorithm in [1]<3> on the transaction set of noun/noun phrases produced in the previous step. each resulting frequent itemset is a possible feature influence:1 type:1 pair index:453 citer id:403 citer title:mining and summarizing customer reviews citer abstract:merchants selling products on the web often ask their customers to review the products that they have purchased and the associated services. as e-commerce is becoming more and more popular, the number of customer reviews that a product receives grows rapidly. for a popular product, the number of reviews can be in hundreds or even thousands. this makes it difficult for a potential customer to read them to make an informed decision on whether to purchase the product. it also makes it difficult for the manufacturer of the product to keep track and to manage customer opinions. for the manufacturer, there are additional difficulties because many merchant sites may sell the same product and the manufacturer normally produces many kinds of products. in this research, we aim to mine and to summarize all the customer reviews of a product. this summarization task is different from traditional text summarization because we only mine the features of the product on which the customers have expressed their opinions and whether the opinions are positive or negative. we do not summarize the reviews by selecting a subset or rewrite some of the original sentences from the reviews to capture the main points as in the classic text summarization. our task is performed in three steps: (1) mining product features that have been commented on by customers; (2) identifying opinion sentences in each review and deciding whether each opinion sentence is positive or negative; (3) summarizing the results. this paper proposes several novel techniques to perform these tasks. our experimental results using reviews of a number of products sold online demonstrate the effectiveness of the techniques citee id:732 citee title:multi-document summarization by graph search and matching citee abstract:we describe a new method for summarizing similarities and differences in a pair of related documents using a graph representation for text. concepts denoted by words, phrases, and proper names in the document are represented positionally as nodes in the graph along with edges corresponding to semantic relations between items. given a perspective in terms of which the pair of documents is to be summarized, the algorithm first uses a spreading activation technique to discover, in each document, nodes semantically related to the topic. the activated graphs of each document are then matched to yield a graph corresponding to similarities and differences between the pair, which is rendered in natural language. an evaluation of these techniques has been carried out. surrounding text:some researchers also studied summarization of multiple documents covering similar information. their main purpose is to summarize the similarities and differences in the information content among these documents [***]<2>. our work is related but quite different because we aim to find the key features that are talked about in multiple reviews influence:3 type:3 pair index:454 citer id:403 citer title:mining and summarizing customer reviews citer abstract:merchants selling products on the web often ask their customers to review the products that they have purchased and the associated services. as e-commerce is becoming more and more popular, the number of customer reviews that a product receives grows rapidly. for a popular product, the number of reviews can be in hundreds or even thousands. this makes it difficult for a potential customer to read them to make an informed decision on whether to purchase the product. it also makes it difficult for the manufacturer of the product to keep track and to manage customer opinions. for the manufacturer, there are additional difficulties because many merchant sites may sell the same product and the manufacturer normally produces many kinds of products. in this research, we aim to mine and to summarize all the customer reviews of a product. this summarization task is different from traditional text summarization because we only mine the features of the product on which the customers have expressed their opinions and whether the opinions are positive or negative. we do not summarize the reviews by selecting a subset or rewrite some of the original sentences from the reviews to capture the main points as in the classic text summarization. our task is performed in three steps: (1) mining product features that have been commented on by customers; (2) identifying opinion sentences in each review and deciding whether each opinion sentence is positive or negative; (3) summarizing the results. this paper proposes several novel techniques to perform these tasks. our experimental results using reviews of a number of products sold online demonstrate the effectiveness of the techniques citee id:570 citee title:foundations of statistical natural language processing citee abstract:statistical approaches to processing natural language text have become dominant in recent years. this foundational text is the first comprehensive introduction to statistical natural language processing (nlp) to appear. the book contains all the theory and algorithms needed for building nlp tools. it provides broad but rigorous coverage of mathematical and linguistic foundations, as well as detailed discussion of statistical methods, allowing students and researchers to construct their own implementations. the book covers collocation finding, word sense disambiguation, probabilistic parsing, information retrieval, and other applications. surrounding text:in the last two steps, the orientation of each opinion sentence is identified and a final summary is produced. note that pos tagging is the part-of-speech tagging [***]<1> from natural language processing, which helps us to find opinion features. below, we discuss each of the sub-steps in turn influence:2 type:1 pair index:455 citer id:403 citer title:mining and summarizing customer reviews citer abstract:merchants selling products on the web often ask their customers to review the products that they have purchased and the associated services. as e-commerce is becoming more and more popular, the number of customer reviews that a product receives grows rapidly. for a popular product, the number of reviews can be in hundreds or even thousands. this makes it difficult for a potential customer to read them to make an informed decision on whether to purchase the product. it also makes it difficult for the manufacturer of the product to keep track and to manage customer opinions. for the manufacturer, there are additional difficulties because many merchant sites may sell the same product and the manufacturer normally produces many kinds of products. in this research, we aim to mine and to summarize all the customer reviews of a product. this summarization task is different from traditional text summarization because we only mine the features of the product on which the customers have expressed their opinions and whether the opinions are positive or negative. we do not summarize the reviews by selecting a subset or rewrite some of the original sentences from the reviews to capture the main points as in the classic text summarization. our task is performed in three steps: (1) mining product features that have been commented on by customers; (2) identifying opinion sentences in each review and deciding whether each opinion sentence is positive or negative; (3) summarizing the results. this paper proposes several novel techniques to perform these tasks. our experimental results using reviews of a number of products sold online demonstrate the effectiveness of the techniques citee id:663 citee title:introduction to wordnet: an on-line lexical database citee abstract:wordnet is an on-line lexical reference system whose design is inspired by current psycholinguistic theories of human lexical memory. english nouns, verbs, and adjectives are organized into synonym sets, each representing one underlying lexical concept. different relations link the synonym sets surrounding text:, positive or negative. a bootstrapping technique is proposed to perform this task using wordnet [***, 12]<1>. finally, we decide the opinion orientation of each sentence influence:3 type:3 pair index:456 citer id:403 citer title:mining and summarizing customer reviews citer abstract:merchants selling products on the web often ask their customers to review the products that they have purchased and the associated services. as e-commerce is becoming more and more popular, the number of customer reviews that a product receives grows rapidly. for a popular product, the number of reviews can be in hundreds or even thousands. this makes it difficult for a potential customer to read them to make an informed decision on whether to purchase the product. it also makes it difficult for the manufacturer of the product to keep track and to manage customer opinions. for the manufacturer, there are additional difficulties because many merchant sites may sell the same product and the manufacturer normally produces many kinds of products. in this research, we aim to mine and to summarize all the customer reviews of a product. this summarization task is different from traditional text summarization because we only mine the features of the product on which the customers have expressed their opinions and whether the opinions are positive or negative. we do not summarize the reviews by selecting a subset or rewrite some of the original sentences from the reviews to capture the main points as in the classic text summarization. our task is performed in three steps: (1) mining product features that have been commented on by customers; (2) identifying opinion sentences in each review and deciding whether each opinion sentence is positive or negative; (3) summarizing the results. this paper proposes several novel techniques to perform these tasks. our experimental results using reviews of a number of products sold online demonstrate the effectiveness of the techniques citee id:118 citee title:thumbs up? sentiment classification using machine learning techniques citee abstract:we consider the problem of classifying documents not by topic, but by overall sentiment, e.g., determining whether a review is positive or negative. using movie reviews as data, we find that standard machine learning techniques definitively outperform human-produced baselines. however, the three machine learning methods we employed (naive bayes, maximum entropy classification, and support vector machines) do not perform as well on sentiment classification as on traditional topic-based categorization. we conclude by examining factors that make the sentiment classification problem more challenging. surrounding text:pang et al. [***]<2> examine several supervised machine learning methods for sentiment classification of movie reviews and conclude that machine learning techniques outperform the method that is based on human-tagged features although none of existing methods could handle the sentiment classification with a reasonable accuracy. our work differs from these works on sentiment classification in that we perform classification at the sentence level while they determine the sentiment of each document influence:1 type:2 pair index:457 citer id:403 citer title:mining and summarizing customer reviews citer abstract:merchants selling products on the web often ask their customers to review the products that they have purchased and the associated services. as e-commerce is becoming more and more popular, the number of customer reviews that a product receives grows rapidly. for a popular product, the number of reviews can be in hundreds or even thousands. this makes it difficult for a potential customer to read them to make an informed decision on whether to purchase the product. it also makes it difficult for the manufacturer of the product to keep track and to manage customer opinions. for the manufacturer, there are additional difficulties because many merchant sites may sell the same product and the manufacturer normally produces many kinds of products. in this research, we aim to mine and to summarize all the customer reviews of a product. this summarization task is different from traditional text summarization because we only mine the features of the product on which the customers have expressed their opinions and whether the opinions are positive or negative. we do not summarize the reviews by selecting a subset or rewrite some of the original sentences from the reviews to capture the main points as in the classic text summarization. our task is performed in three steps: (1) mining product features that have been commented on by customers; (2) identifying opinion sentences in each review and deciding whether each opinion sentence is positive or negative; (3) summarizing the results. this paper proposes several novel techniques to perform these tasks. our experimental results using reviews of a number of products sold online demonstrate the effectiveness of the techniques citee id:271 citee title:automatic text decomposition using text segments and text themes citee abstract:with the widespread use of full-text information retrieval, passage-retrieval techniques are becoming increasingly popular. larger texts can then be replaced by important text excerpts, thereby simplifying the retrieval task and improving retrieval effectiveness. passagelevel evidence about the use of words in local contexts is also useful for resolving language ambiguities and improving retrieval output. two main text decomposition strategies are introduced in this study, including a chronological decomposition into text segments, and semantic decomposition into text themes. the interaction between text segments and text themes is then used to characterize text structure, and to formulate specifications for information retrieval, text traversal, and text summarization. surrounding text:for a manufacturer, it is possible to combine summaries from multiple merchant sites to produce a single report for each of its products. our task is different from traditional text summarization [15, 39, ***]<2> in a number of ways. first of all, a summary in our case is structured rather than another (but shorter) free text document as produced by most text summarization systems influence:3 type:3 pair index:458 citer id:403 citer title:mining and summarizing customer reviews citer abstract:merchants selling products on the web often ask their customers to review the products that they have purchased and the associated services. as e-commerce is becoming more and more popular, the number of customer reviews that a product receives grows rapidly. for a popular product, the number of reviews can be in hundreds or even thousands. this makes it difficult for a potential customer to read them to make an informed decision on whether to purchase the product. it also makes it difficult for the manufacturer of the product to keep track and to manage customer opinions. for the manufacturer, there are additional difficulties because many merchant sites may sell the same product and the manufacturer normally produces many kinds of products. in this research, we aim to mine and to summarize all the customer reviews of a product. this summarization task is different from traditional text summarization because we only mine the features of the product on which the customers have expressed their opinions and whether the opinions are positive or negative. we do not summarize the reviews by selecting a subset or rewrite some of the original sentences from the reviews to capture the main points as in the classic text summarization. our task is performed in three steps: (1) mining product features that have been commented on by customers; (2) identifying opinion sentences in each review and deciding whether each opinion sentence is positive or negative; (3) summarizing the results. this paper proposes several novel techniques to perform these tasks. our experimental results using reviews of a number of products sold online demonstrate the effectiveness of the techniques citee id:233 citee title:analysis of syntax-based pronoun resolution methods citee abstract:this paper presents a pronoun resolution algo- rithm that adheres to the constraints and rules of centering theory (grosz et al., 1995) and is an alternative to brennan et al.'s 1987 algorithm. the advantages of this new model, the left-right centering algorithm (lrc), lie in its incremental processing of utterances and in its low computational overhead. the algorithm is compared with three other pronoun resolution methods: hobbs' syntax-based algorithm, strube's s-list approach, and the bfp surrounding text:we believe that they may be used in practical settings. we also note three main limitations of our system: (1) we have not dealt with opinion sentences that need pronoun resolution [***]<3>. for instance, it is quiet but powerful influence:3 type:3 pair index:459 citer id:403 citer title:mining and summarizing customer reviews citer abstract:merchants selling products on the web often ask their customers to review the products that they have purchased and the associated services. as e-commerce is becoming more and more popular, the number of customer reviews that a product receives grows rapidly. for a popular product, the number of reviews can be in hundreds or even thousands. this makes it difficult for a potential customer to read them to make an informed decision on whether to purchase the product. it also makes it difficult for the manufacturer of the product to keep track and to manage customer opinions. for the manufacturer, there are additional difficulties because many merchant sites may sell the same product and the manufacturer normally produces many kinds of products. in this research, we aim to mine and to summarize all the customer reviews of a product. this summarization task is different from traditional text summarization because we only mine the features of the product on which the customers have expressed their opinions and whether the opinions are positive or negative. we do not summarize the reviews by selecting a subset or rewrite some of the original sentences from the reviews to capture the main points as in the classic text summarization. our task is performed in three steps: (1) mining product features that have been commented on by customers; (2) identifying opinion sentences in each review and deciding whether each opinion sentence is positive or negative; (3) summarizing the results. this paper proposes several novel techniques to perform these tasks. our experimental results using reviews of a number of products sold online demonstrate the effectiveness of the techniques citee id:694 citee title:learning subjective adjectives from corpora citee abstract:subjectivity tagging is distinguishing sentences used to present opinions and evaluations from sentences used to objectively present factual information. there are numerous applications for which subjectivity tagging is relevant, including information extraction and information retrieval. this paper identifies strong clues of subjectivity using the results of a method for clustering words according to distributional similarity (lin 1998), seeded by a small amount of detailed manual annotation surrounding text:these are words that are primarily used to express subjective opinions. clearly, this is related to existing work on distinguishing sentences used to express subjective opinions from sentences used to objectively describe some factual information [***]<2>. previous work on subjectivity [44, 4]<2> has established a positive statistically significant correlation with the presence of adjectives influence:1 type:2 pair index:460 citer id:427 citer title:probabilistic latent semantic indexing citer abstract:probabilistic latent semantic indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data. fitted from a training corpus of text documents by a generalization of the expectation maximization algorithm, the utilized model is able to deal with domain{specific synonymy as well as with polysemous words. in contrast to standard latent semantic indexing (lsi) by singular value decomposition, the probabilistic variant has a solid statistical foundation and defines a proper generative data model. retrieval experiments on a number of test collections indicate substantial performance gains over direct term matching methods as well as over lsi. in particular, the combination of models with di erent dimensionalities has proven to be advantageous citee id:655 citee title:indexing by latent semantic analysis citee abstract:a new method for automatic indexing and retrieval is described. the approach is to take advantage of implicit higher-order structure in the association of terms with documents (semantic structure) in order to improve the detection of relevant documents on the basis of terms found in queries. the particular technique used is singular-value decomposition, in which a large term by document matrix is decomposed into a set of ca. 100 or- thogonal factors from which the original matrix can be approximated by linear combination. documents are represented by ca. 100 item vectors of factor weights. queries are represented as pseudo-document vectors formed from weighted combinations of terms, and documents with supra-threshold cosine values are re- turned. initial tests find this completely automatic method for retrieval to be promising. surrounding text:yet, it is well known that literal term matching has severe drawbacks, mainly due to the ambivalence of words and their unavoidable lack of precision as well as due to personal style and individual di erences in word usage. latent semantic analysis (lsa) [***]<1> is an approach to automatic indexing and information retrieval that attempts to overcome these problems by mapping documents as well as terms to a representation in the so{called latent semantic space. lsa usually takes the (high dimensional) vector space representation of documents based on term frequencies [14]<1> as a starting point and applies a dimension reducing linear projection. 4 probabilistic latent semantic analysis 4. 1 latent semantic analysis as mentioned in the introduction, the key idea of lsa [***]<1> is to map documents (and by symmetry terms) to a vector space of reduced dimensionality, the latent semantic space. this mapping is computed by decomposing the term/document matrix n with svd, n = uvt, where u and v are orthogonal matrices utu = vtv = i and the diagonal matrix  contains the singular values of n. queries or documents which were not part of the original collection can be folded in by a simple matrix multiplication (cf. [***]<1> for details). in 0 50 100 0 20 40 60 80 100 med, tf recall [%] precision [%] 0 50 100 0 20 40 60 80 100 med, tfidf recall [%] precision [%] 0 50 100 0 10 20 30 40 50 60 70 80 cran, tf recall [%] 0 50 100 0 10 20 30 40 50 60 70 80 cran, tfidf recall [%] 0 50 100 0 10 20 30 40 50 60 70 cacm, tf recall [%] 0 50 100 0 10 20 30 40 50 60 70 cacm, tfidf recall [%] 0 50 100 0 10 20 30 40 50 60 cisi, tf recall [%] 0 50 100 0 10 20 30 40 50 60 cisi, tfidf recall [%] cos influence:1 type:1 pair index:461 citer id:427 citer title:probabilistic latent semantic indexing citer abstract:probabilistic latent semantic indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data. fitted from a training corpus of text documents by a generalization of the expectation maximization algorithm, the utilized model is able to deal with domain{specific synonymy as well as with polysemous words. in contrast to standard latent semantic indexing (lsi) by singular value decomposition, the probabilistic variant has a solid statistical foundation and defines a proper generative data model. retrieval experiments on a number of test collections indicate substantial performance gains over direct term matching methods as well as over lsi. in particular, the combination of models with di erent dimensionalities has proven to be advantageous citee id:692 citee title:latent semantic indexing (lsi): trec-3 report citee abstract:this paper reports on recent developments of thelatent semantic indexing (lsi) retrieval method fortrec-3fi lsi uses a reduced-dimension vector spaceto represent words and documentsfi an importantaspect of this representation is that the associationbetween terms is automatically captured, explicitlyrepresented, and used to improve retrievalfiwe used lsi for both trec-3 routing and adhoctasksfi for the routing tasks an lsi space wasconstructed using the training documentsfi wecompared surrounding text:in many applications this has proven to result in more robust word processing. although lsa has been applied with remarkable success in di erent domains including automatic indexing (latent semantic indexing, lsi) [1, ***]<2>, it has a number of deficits, mainly due to its unsatisfactory statistical foundation. the primary goal of this paper is to present a novel approach to lsa and factor analysis { called probabilistic latent semantic analysis (plsa) { that has a solid statistical foundation, since it is based on the likelihood principle and defines a proper generative model of the data influence:2 type:3 pair index:462 citer id:427 citer title:probabilistic latent semantic indexing citer abstract:probabilistic latent semantic indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data. fitted from a training corpus of text documents by a generalization of the expectation maximization algorithm, the utilized model is able to deal with domain{specific synonymy as well as with polysemous words. in contrast to standard latent semantic indexing (lsi) by singular value decomposition, the probabilistic variant has a solid statistical foundation and defines a proper generative data model. retrieval experiments on a number of test collections indicate substantial performance gains over direct term matching methods as well as over lsi. in particular, the combination of models with di erent dimensionalities has proven to be advantageous citee id:826 citee title:topic-based language models using em citee abstract:in this paper, we propose a novel statistical language model to capture topic-related long-range dependencies. topics are modeled in a latent variable framework in which we also derive an em algorithm to perform a topic factor decomposition based on a segmented training corpus. the topic model is combined with a standard language model to be used for on-line word prediction. perplexity results indicate an improvement over previously proposed topic models, which unfortunately has not been translated into lower word error. surrounding text:g. , for language modeling [***]<2> and collaborative filtering [5]<2>. acknowledgment this work has been supported by a daad postdoctoral fellowship influence:2 type:3 pair index:463 citer id:427 citer title:probabilistic latent semantic indexing citer abstract:probabilistic latent semantic indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data. fitted from a training corpus of text documents by a generalization of the expectation maximization algorithm, the utilized model is able to deal with domain{specific synonymy as well as with polysemous words. in contrast to standard latent semantic indexing (lsi) by singular value decomposition, the probabilistic variant has a solid statistical foundation and defines a proper generative data model. retrieval experiments on a number of test collections indicate substantial performance gains over direct term matching methods as well as over lsi. in particular, the combination of models with di erent dimensionalities has proven to be advantageous citee id:681 citee title:latent class models for collaborative filtering citee abstract: this paper presents a statistical approach to collaborative ltering and investigates the use of latent class models for predicting individ - ual choices and preferences based on observed preference behavior two models are discussed and compared: the aspect model, a proba - bilistic latent space model which models indi - vidual preferences as a convex combination of preference factors, and the two - sided clustering model, which simultaneously partitions persons and objects into clusters we present em algo - rithms for di erent variants of the aspect model and derive an approximate em algorithm based on a variational principle for the two - sided clus - tering model the bene ts of the di erent mod - els are experimentally investigated on a large movie data set surrounding text:g. , for language modeling [4]<2> and collaborative filtering [***]<2>. acknowledgment this work has been supported by a daad postdoctoral fellowship influence:2 type:3 pair index:464 citer id:427 citer title:probabilistic latent semantic indexing citer abstract:probabilistic latent semantic indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data. fitted from a training corpus of text documents by a generalization of the expectation maximization algorithm, the utilized model is able to deal with domain{specific synonymy as well as with polysemous words. in contrast to standard latent semantic indexing (lsi) by singular value decomposition, the probabilistic variant has a solid statistical foundation and defines a proper generative data model. retrieval experiments on a number of test collections indicate substantial performance gains over direct term matching methods as well as over lsi. in particular, the combination of models with di erent dimensionalities has proven to be advantageous citee id:169 citee title:probabilistic latent semantic analysis citee abstract:probabilistic latent semantic analysis is a novel statistical technique for the analysis of two{mode and co-occurrence data, which has applications in information retrieval and ltering, natural language processing, ma-chine learning from text, and in related ar-easfi compared to standard latent semantic analysis which stems from linear algebra and performs a singular value decomposition of co-occurrence tables, the proposed method is based on a mixture decomposition derived from a latent class modelfi this results in a more principled approach which has a solid foundation in statisticsfi in order to avoid over tting, we propose a widely applicable generalization of maximum likelihood model tting by tempered emfi our approach yields substantial and consistent improvements over latent semantic analysis in a number of ex-perimentsfi surrounding text:for example, although a term like \un" would by itself be best explained by the \bosnia" factor, the context of the other query terms drastically increases the probability that this particular occurrence of \un" is related to the events in rwanda. the same mechanism is able to detect \true" polysems [***]<3>. 5 probabilistic latent semantic indexing 5 influence:1 type:1 pair index:465 citer id:427 citer title:probabilistic latent semantic indexing citer abstract:probabilistic latent semantic indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data. fitted from a training corpus of text documents by a generalization of the expectation maximization algorithm, the utilized model is able to deal with domain{specific synonymy as well as with polysemous words. in contrast to standard latent semantic indexing (lsi) by singular value decomposition, the probabilistic variant has a solid statistical foundation and defines a proper generative data model. retrieval experiments on a number of test collections indicate substantial performance gains over direct term matching methods as well as over lsi. in particular, the combination of models with di erent dimensionalities has proven to be advantageous citee id:827 citee title:unsupervised learning from dyadic data citee abstract:dyadic data refers to a domain with two finite sets of objects in which observa tions are made for dyads ie pairs with one element from either set this includes event cooccurrences histogram data and single stimulus preference data as special cases dyadic data arises naturally in many applications ranging from computational linguistics and information retrieval to preference analysis and computer vision in this paper we present a systematic domainindependent framework for unsupervised learning from dyadic data by statistical mixture models our approach covers dier ent models with at and hierarchical latent class structures and unifies probabilistic modeling and structure discovery mixture models provide both a parsimonious yet exible parameterization of probability distributions with good generalization perfor mance on sparse data as well as structural information about datainherent grouping structure we propose an annealed version of the standard expectation maximization algorithm for model fitting which is empirically evaluated on a variety of data sets from dierent domains surrounding text:in addition, the factor representation obtained by plsa allows to deal with polysemous words and to explicitly distinguish between di erent meanings and di erent types of word usage. 2 the aspect model the core of plsa is a statistical model which has been called aspect model [***, 15]<1>. the latter is a latent variable model for general co-occurrence data which associates an unobserved class variable z 2 z = fz1 influence:1 type:1 pair index:466 citer id:427 citer title:probabilistic latent semantic indexing citer abstract:probabilistic latent semantic indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data. fitted from a training corpus of text documents by a generalization of the expectation maximization algorithm, the utilized model is able to deal with domain{specific synonymy as well as with polysemous words. in contrast to standard latent semantic indexing (lsi) by singular value decomposition, the probabilistic variant has a solid statistical foundation and defines a proper generative data model. retrieval experiments on a number of test collections indicate substantial performance gains over direct term matching methods as well as over lsi. in particular, the combination of models with di erent dimensionalities has proven to be advantageous citee id:828 citee title:tdt pilot study corpus citee abstract:the tdt corpus includes approximately 16,000 stories about half collected from reuters newswire and half from cnn broadcast news transcripts during the period july 1, 1994 to june 30, 1995. an integral and key part of the corpus is the annotation in terms of news events discussed in the stories. twenty-five events were defined that span a variety of event types and that cover a subset of the events discussed in the corpus stories. annotation data for these events are included in the corpus and provide a basis for training tdt systems surrounding text:8 1 aid food medical people un war figure 2: folding in a query conisting of the terms \aid", \food", \medical", \people", \un", and \war": evolution of posterior probabilities and the mixing proportions p(zjq) (rightmost column in each bar plot) for the four factors depicted in table 2 after 1 (first row), 2 (second row), 3 (third row), and 20 (fourth row) iterations. iments with the tdt-1 collection, which contains 15,862 documents of broadcast news stories [***]<3>. 1 stop words have been eliminated by a standard stop word list, no stemming or further preprocessing has been performed influence:3 type:3 pair index:467 citer id:427 citer title:probabilistic latent semantic indexing citer abstract:probabilistic latent semantic indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data. fitted from a training corpus of text documents by a generalization of the expectation maximization algorithm, the utilized model is able to deal with domain{specific synonymy as well as with polysemous words. in contrast to standard latent semantic indexing (lsi) by singular value decomposition, the probabilistic variant has a solid statistical foundation and defines a proper generative data model. retrieval experiments on a number of test collections indicate substantial performance gains over direct term matching methods as well as over lsi. in particular, the combination of models with di erent dimensionalities has proven to be advantageous citee id:740 citee title:mixture models citee abstract:mixture models are an interesting and exible model family. the different uses of mixture models include for example generative component models, clustering and density estimation. moreover, mixture models have been successfully used in various kinds of tasks such as modelling failure rate data, clustering teaching behaviour and in general for modelling large heterogeneous populations. the first step, when using mixture models, is to define a suitable model for the data. next the parameter of the model must be estimated from data. this phase is called parameter estimation or parameter learning. another important task is finding the number of subpopulations that the data supports and this problem is called model learning. in this paper we will discuss both issues and cover the basic methods. we also introduce some modern methods and give numerous examples. surrounding text:where (1) p(wjd) = xz2z p(wjz)p(zjd) : (2) essentially, to derive (2) one has to sum over the possible choices of z which could have generated the observation. the aspect model is a statistical mixture model [***]<1> which is based on two independence assumptions: first, observation pairs (d. w) are assumed to be generated independently influence:2 type:1 pair index:468 citer id:427 citer title:probabilistic latent semantic indexing citer abstract:probabilistic latent semantic indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data. fitted from a training corpus of text documents by a generalization of the expectation maximization algorithm, the utilized model is able to deal with domain{specific synonymy as well as with polysemous words. in contrast to standard latent semantic indexing (lsi) by singular value decomposition, the probabilistic variant has a solid statistical foundation and defines a proper generative data model. retrieval experiments on a number of test collections indicate substantial performance gains over direct term matching methods as well as over lsi. in particular, the combination of models with di erent dimensionalities has proven to be advantageous citee id:195 citee title:a view of the em algorithm that justifies incremental and other variants citee abstract:the em algorithm performs maximum likelihood estimation for data in which some variables are unobserved. we present a function that resembles negative free energy and show that the m step maximizes this function with respect to the model parameters and the e step maximizes it with respect to the distribution over the unobserved variables. from this perspective, it is easy to justify an incremental variant of the em algorithm in which the distribution for only one of the unobserved variables surrounding text:w)= p(z) [p(djz)p(wjz)]fi pz0p(z0) [p(djz0)p(wjz0)]fi : (9) notice that fi = 1 results in the standard e{step, while for fi < 1 the likelihood part in bayes' formula is discounted (additively on the log{scale). it can be shown, that tem minimizes an objective function known as the free energy [***]<3> and hence defines a convergent algorithm. while temperature{based generalizations of em and related algorithms for optimization are often used as a homotopy or continuation method to avoid unfavorable local extrema, the main advantage of tem in our context is to avoid overfitting influence:2 type:1 pair index:469 citer id:427 citer title:probabilistic latent semantic indexing citer abstract:probabilistic latent semantic indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data. fitted from a training corpus of text documents by a generalization of the expectation maximization algorithm, the utilized model is able to deal with domain{specific synonymy as well as with polysemous words. in contrast to standard latent semantic indexing (lsi) by singular value decomposition, the probabilistic variant has a solid statistical foundation and defines a proper generative data model. retrieval experiments on a number of test collections indicate substantial performance gains over direct term matching methods as well as over lsi. in particular, the combination of models with di erent dimensionalities has proven to be advantageous citee id:444 citee title:distributional clustering of english words citee abstract:we describe and experimentally evaluate a method for automatically clustering words accordingto their distribution in particular syntactic contextsfi words are represented by the relative frequencydistributions of contexts in which they appear, and relative entropy is used to measure the dissimilarityof those distributionsfi clusters are represented by "typical" context distributions averaged from thegiven words according to their probabilities of cluster membership, and in many cases can surrounding text:here, we propose a generalization of maximum likelihood for mixture models { called tempered em (tem) { which is based on entropic regularization and is closely related to a method known as deterministic annealing [13]<1>. since a principled derivation of tem is beyond the scope of this paper (the interested reader is referred to [***, 7]<3>), we will present the necessary modification of standard em in an ad hoc manner. essentially, one introduces a control parameter fi (inverse computational temperature) and modifies the e-step in (5) according to pfi(zjd influence:3 type:3 pair index:470 citer id:427 citer title:probabilistic latent semantic indexing citer abstract:probabilistic latent semantic indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data. fitted from a training corpus of text documents by a generalization of the expectation maximization algorithm, the utilized model is able to deal with domain{specific synonymy as well as with polysemous words. in contrast to standard latent semantic indexing (lsi) by singular value decomposition, the probabilistic variant has a solid statistical foundation and defines a proper generative data model. retrieval experiments on a number of test collections indicate substantial performance gains over direct term matching methods as well as over lsi. in particular, the combination of models with di erent dimensionalities has proven to be advantageous citee id:566 citee title:introduction to modern information retrieval citee abstract:new technology now allows the design of sophisticated information retrieval system that can not only analyze, process and store, but can also retrieve specific resources matching a particular users needs. this clear and practical text relates the theory, techniques and tools critical to making information retrieval work. a completely revised second edition incorporates the latest developments in this rapidly expanding field, including multimedia information retrieval, user interfaces and digital libraries. chowdhurys coverage is comprehensive, including classification, cataloging, subject indexing, ing, vocabulary control; cd-rom and online information retrieval; multimedia, hypertext and hypermedia; expert systems and natural language processing; user interface systems; internet, world wide web and digital library environments. illustrated with many examples and comprehensively referenced for an international audience, this is an ideal textbook for students of library and information studies and those professionals eager to advance their knowledge of the future of information. surrounding text:latent semantic analysis (lsa) [1]<1> is an approach to automatic indexing and information retrieval that attempts to overcome these problems by mapping documents as well as terms to a representation in the so{called latent semantic space. lsa usually takes the (high dimensional) vector space representation of documents based on term frequencies [***]<1> as a starting point and applies a dimension reducing linear projection. the specific form of this mapping is determined by a given document collection and is based on a singular value decomposition (svd) of the corresponding term/document matrix. 5 probabilistic latent semantic indexing 5. 1 vector-space models and lsi one of the most popular families of information retrieval techniques is based on the vector{space model (vsm) for documents [***]<1>. a vsm variant is characterized by three ingredients: (i) a transformation function (also called local term weight), (ii) a term weighting scheme (also called global term weight), and (iii) a similarity measure influence:3 type:1 pair index:471 citer id:427 citer title:probabilistic latent semantic indexing citer abstract:probabilistic latent semantic indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data. fitted from a training corpus of text documents by a generalization of the expectation maximization algorithm, the utilized model is able to deal with domain{specific synonymy as well as with polysemous words. in contrast to standard latent semantic indexing (lsi) by singular value decomposition, the probabilistic variant has a solid statistical foundation and defines a proper generative data model. retrieval experiments on a number of test collections indicate substantial performance gains over direct term matching methods as well as over lsi. in particular, the combination of models with di erent dimensionalities has proven to be advantageous citee id:203 citee title:aggregate and mixed{ order markov models for statistical language processing citee abstract:we consider the use of language models whose size and accuracy are intermediate between different order n-gram models. two types of models are studied in particular. aggregate markov models are classbased bigram models in which the mapping from words to classes is probabilistic. mixed-order markov models combine bigram models whose predictions are conditioned on different words. both types of models are trained by expectationmaximization (em) algorithms for maximum likelihood estimation. we examine smoothing procedures in which these models are interposed between different order n-grams. this is found to significantly reduce the perplexity of unseen word combinations. surrounding text:in addition, the factor representation obtained by plsa allows to deal with polysemous words and to explicitly distinguish between di erent meanings and di erent types of word usage. 2 the aspect model the core of plsa is a statistical model which has been called aspect model [7, ***]<1>. the latter is a latent variable model for general co-occurrence data which associates an unobserved class variable z 2 z = fz1 influence:1 type:1 pair index:472 citer id:427 citer title:probabilistic latent semantic indexing citer abstract:probabilistic latent semantic indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data. fitted from a training corpus of text documents by a generalization of the expectation maximization algorithm, the utilized model is able to deal with domain{specific synonymy as well as with polysemous words. in contrast to standard latent semantic indexing (lsi) by singular value decomposition, the probabilistic variant has a solid statistical foundation and defines a proper generative data model. retrieval experiments on a number of test collections indicate substantial performance gains over direct term matching methods as well as over lsi. in particular, the combination of models with di erent dimensionalities has proven to be advantageous citee id:818 citee title:predicting the performance of linearly combined ir systems citee abstract:we introduce a new technique for analyzingcombination modelsfi the technique allows us to makequalitative conclusions about which ir systems shouldbe combinedfi we achieve this by using a linear regressionto accurately (r2= 0:98) predict the performance of thecombined system based on quantitative measurementsof individual component systems taken from trec5fiwhen applied to a linear model (weighted sum of relevancescores), the technique supports several previouslysuggested hypotheses: surrounding text:), as suggested in [3]<1> (cf. [***]<1> for a more detailed empirical investigation of linear combination schemes for information retrieval systems). 5 influence:3 type:3 pair index:473 citer id:436 citer title:discovering objects and their location in images citer abstract:given a set of images containing multiple object categories, we seek to discover those categories and their image locations without supervision. we achieve this using generative models from the statistical text literature: probabilistic latent semantic analysis (plsa), and latent dirichlet allocation (lda). in text analysis these are used to discover topics in a corpus using the bag-of-words document representation. here we discover topics as object categories, so that an image containing instances of several categories is modelled as a mixture of topics. the models are applied to images by using a visual analogue of a word, formed by vector quantizing sift like region descriptors. we investigate a set of increasingly demanding scenarios, starting with image sets containing only two object categories through to sets containing multiple categories (including airplanes, cars, faces, motorbikes, spotted cats) and background clutter. the object categories sample both intra-class and scale variation, and both the categories and their approximate spatial layout are found without supervision. we also demonstrate classification of unseen images and images containing multiple objects. performance of the proposed unsupervised method is compared to the semisupervised approach of citee id:421 citee title:matching words and pictures citee abstract:: we present a new an very rich approach for moeling multi-moal ata sets,focusing on the specific case of segmente images with associate textfi learning thejoint istribution of image regions an wors has many applicationsfi we consier inetail preicting wors associate with whole images (auto-annotation) ancorresponing to particular image regions (region naming)fi auto-annotation mighthelp organize an access large collections of imagesfi region naming is a moel ofobject surrounding text:introduction common approaches to object recognition involve some form of supervision. this may range from specifying the objects location and segmentation, as in face detection [17, 24]<2>, to providing only auxiliary data indicating the objects identity [***, 5, 7, 25]<2>. for a large dataset, any annotation is expensive, or may introduce unforeseen biases influence:1 type:2 pair index:474 citer id:436 citer title:discovering objects and their location in images citer abstract:given a set of images containing multiple object categories, we seek to discover those categories and their image locations without supervision. we achieve this using generative models from the statistical text literature: probabilistic latent semantic analysis (plsa), and latent dirichlet allocation (lda). in text analysis these are used to discover topics in a corpus using the bag-of-words document representation. here we discover topics as object categories, so that an image containing instances of several categories is modelled as a mixture of topics. the models are applied to images by using a visual analogue of a word, formed by vector quantizing sift like region descriptors. we investigate a set of increasingly demanding scenarios, starting with image sets containing only two object categories through to sets containing multiple categories (including airplanes, cars, faces, motorbikes, spotted cats) and background clutter. the object categories sample both intra-class and scale variation, and both the categories and their approximate spatial layout are found without supervision. we also demonstrate classification of unseen images and images containing multiple objects. performance of the proposed unsupervised method is compared to the semisupervised approach of citee id:422 citee title:minimum complexity density estimation citee abstract:the authors introduce an index of resolvability that is proved to bound the rate of convergence of minimum complexity density estimators as well as the information-theoretic redundancy of the corresponding total description length. the results on the index of resolvability demonstrate the statistical effectiveness of the minimum description-length principle as a method of inference. the minimum complexity estimator converges to true density nearly as fast as an estimator based on prior knowledge of the true subclass of densities. interpretations and basic properties of minimum complexity estimators are discussed. some regression and classification problems that can be examined from the minimum description-length framework are considered surrounding text:even though the new categories all had substantially fewer images (around 200), the results are still encouraging. discussion: in the experiments it was necessary to specify the number of topics k, however bayesian [21]<3> or minimum complexity methods [***]<3> can be used to infer the number of topics implied by a corpus. while designing these experiments, we grew to appreciate the many difficulties in searching for good datasets influence:3 type:3 pair index:475 citer id:436 citer title:discovering objects and their location in images citer abstract:given a set of images containing multiple object categories, we seek to discover those categories and their image locations without supervision. we achieve this using generative models from the statistical text literature: probabilistic latent semantic analysis (plsa), and latent dirichlet allocation (lda). in text analysis these are used to discover topics in a corpus using the bag-of-words document representation. here we discover topics as object categories, so that an image containing instances of several categories is modelled as a mixture of topics. the models are applied to images by using a visual analogue of a word, formed by vector quantizing sift like region descriptors. we investigate a set of increasingly demanding scenarios, starting with image sets containing only two object categories through to sets containing multiple categories (including airplanes, cars, faces, motorbikes, spotted cats) and background clutter. the object categories sample both intra-class and scale variation, and both the categories and their approximate spatial layout are found without supervision. we also demonstrate classification of unseen images and images containing multiple objects. performance of the proposed unsupervised method is compared to the semisupervised approach of citee id:164 citee title:latent dirichlet allocation citee abstract:we describe latent dirichlet allocation (lda), a generative probabilistic model for collections of discrete data such as text corpora. lda is a three-level hierarchical bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. in the context of text modeling, the topic probabilities provide an explicit representation of a document. we present efficient approximate inference techniques based on variational methods and an em algorithm for empirical bayes parameter estimation. we report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic lsi model surrounding text:documents are images and we quantize local appearance descriptions to form visual words [4, 18, 20, 26]<2>. the two models we investigate are the probabilistic latent semantic analysis (plsa) of hofmann [9, 10]<1>, and the latent dirichlet allocation (lda) of blei et al$ [***]<1>. both use the bag of words model, where positional relationships between features are ignored influence:1 type:1 pair index:476 citer id:436 citer title:discovering objects and their location in images citer abstract:given a set of images containing multiple object categories, we seek to discover those categories and their image locations without supervision. we achieve this using generative models from the statistical text literature: probabilistic latent semantic analysis (plsa), and latent dirichlet allocation (lda). in text analysis these are used to discover topics in a corpus using the bag-of-words document representation. here we discover topics as object categories, so that an image containing instances of several categories is modelled as a mixture of topics. the models are applied to images by using a visual analogue of a word, formed by vector quantizing sift like region descriptors. we investigate a set of increasingly demanding scenarios, starting with image sets containing only two object categories through to sets containing multiple categories (including airplanes, cars, faces, motorbikes, spotted cats) and background clutter. the object categories sample both intra-class and scale variation, and both the categories and their approximate spatial layout are found without supervision. we also demonstrate classification of unseen images and images containing multiple objects. performance of the proposed unsupervised method is compared to the semisupervised approach of citee id:424 citee title:object recognition as machine translation: learning a lexicon for a fixed image vocabulary citee abstract:we describe a model of object recognition as machine translationfi surrounding text:introduction common approaches to object recognition involve some form of supervision. this may range from specifying the objects location and segmentation, as in face detection [17, 24]<2>, to providing only auxiliary data indicating the objects identity [1, ***, 7, 25]<2>. for a large dataset, any annotation is expensive, or may introduce unforeseen biases influence:2 type:2 pair index:477 citer id:436 citer title:discovering objects and their location in images citer abstract:given a set of images containing multiple object categories, we seek to discover those categories and their image locations without supervision. we achieve this using generative models from the statistical text literature: probabilistic latent semantic analysis (plsa), and latent dirichlet allocation (lda). in text analysis these are used to discover topics in a corpus using the bag-of-words document representation. here we discover topics as object categories, so that an image containing instances of several categories is modelled as a mixture of topics. the models are applied to images by using a visual analogue of a word, formed by vector quantizing sift like region descriptors. we investigate a set of increasingly demanding scenarios, starting with image sets containing only two object categories through to sets containing multiple categories (including airplanes, cars, faces, motorbikes, spotted cats) and background clutter. the object categories sample both intra-class and scale variation, and both the categories and their approximate spatial layout are found without supervision. we also demonstrate classification of unseen images and images containing multiple objects. performance of the proposed unsupervised method is compared to the semisupervised approach of citee id:425 citee title:learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories citee abstract:current computational approaches to learning visual object categories require thousands of training images, are slow, cannot learn in an incremental manner and cannot incorporate prior information into the learning process. in addition, no algorithm presented in the literature has been tested on more than a handful of object categories. we present an method for learning object categories from just a few training images. it is quick and it uses prior information in a principled way. we test it on a dataset composed of images of objects belonging to 101 widely varied categories. our proposed method is based on making use of prior information, assembled from (unrelated) object categories which were previously learnt. a generative probabilistic model is used, which represents the shape and appearance of a constellation of features belonging to the object. the parameters of the model are learnt incrementally in a bayesian manner. our incremental algorithm is compared experimentally to an earlier batch bayesian algorithm, as well as to one based on maximum-likelihood. the incremental and batch versions have comparable classification performance on small training sets, but incremental learning is significantly faster, making real-time learning feasible. both bayesian methods outperform maximum likelihood on small training sets. surrounding text:data sets: our data set consists of six categories from the caltech image datasets (as previously used by fergus et al. [7]<3> for semi-supervised classification), and two categories ((7) and (8) below) from the more difficult 101 category dataset [***]<3>. label description # images (1) all faces 435 (1ub) faces on uniform background 435 a cropped version of (1) (2) all motorbikes 800 (2ub) motorbikes on uniform background 349 a subset of (2) (3) all airplanes 800 (3ub) airplanes on uniform background 263 a subset of (3) (4) cars rear 1155 (5) leopards 200 (6) background 1370 (7) watch 241 (8) ketch 114 the reason for picking these particular categories is pragmatic: they are the ones with the greatest number of images per category influence:1 type:2 pair index:478 citer id:436 citer title:discovering objects and their location in images citer abstract:given a set of images containing multiple object categories, we seek to discover those categories and their image locations without supervision. we achieve this using generative models from the statistical text literature: probabilistic latent semantic analysis (plsa), and latent dirichlet allocation (lda). in text analysis these are used to discover topics in a corpus using the bag-of-words document representation. here we discover topics as object categories, so that an image containing instances of several categories is modelled as a mixture of topics. the models are applied to images by using a visual analogue of a word, formed by vector quantizing sift like region descriptors. we investigate a set of increasingly demanding scenarios, starting with image sets containing only two object categories through to sets containing multiple categories (including airplanes, cars, faces, motorbikes, spotted cats) and background clutter. the object categories sample both intra-class and scale variation, and both the categories and their approximate spatial layout are found without supervision. we also demonstrate classification of unseen images and images containing multiple objects. performance of the proposed unsupervised method is compared to the semisupervised approach of citee id:426 citee title:finding scientific topics citee abstract:a first step in identifying the content of a document is determining which topics that document addresses. we describe a generative model for documents, introduced by blei, ng, and jordan , in which each document is generated by choosing a distribution over topics and then choosing each word in the document from a topic selected according to this distribution. we then present a markov chain monte carlo algorithm for inference in this model. we use this algorithm to analyze s from pnas by using bayesian model selection to establish the number of topics. we show that the extracted topics capture meaningful structure in the data, consistent with the class designations provided by the authors of the articles, and outline further applications of this analysis, including identifying hot topics by examining temporal dynamics and tagging abstracts to illustrate semantic content. surrounding text:the goal is to maximize the following likelihood: p(w, , fi) = z xz p(wz, )p(z)p( )p(fi)d (3) where  and  are multinomial parameters over the topics and words respectively and p( ) and p(fi) are dirichlet distributions parameterized by the hyperparameters and fi. since the integral is intractable to solve directly, we solve for the parameters using gibbs sampling, as described in [***]<1>. the hyperparameters control the mixing of the multinomial weights (lower values give less mixing) and can prevent degeneracy. the hyperparameters control the mixing of the multinomial weights (lower values give less mixing) and can prevent degeneracy. as in [***]<1>, we specialize to scalar hyperparameters (e. g influence:1 type:1 pair index:479 citer id:436 citer title:discovering objects and their location in images citer abstract:given a set of images containing multiple object categories, we seek to discover those categories and their image locations without supervision. we achieve this using generative models from the statistical text literature: probabilistic latent semantic analysis (plsa), and latent dirichlet allocation (lda). in text analysis these are used to discover topics in a corpus using the bag-of-words document representation. here we discover topics as object categories, so that an image containing instances of several categories is modelled as a mixture of topics. the models are applied to images by using a visual analogue of a word, formed by vector quantizing sift like region descriptors. we investigate a set of increasingly demanding scenarios, starting with image sets containing only two object categories through to sets containing multiple categories (including airplanes, cars, faces, motorbikes, spotted cats) and background clutter. the object categories sample both intra-class and scale variation, and both the categories and their approximate spatial layout are found without supervision. we also demonstrate classification of unseen images and images containing multiple objects. performance of the proposed unsupervised method is compared to the semisupervised approach of citee id:427 citee title:probabilistic latent semantic indexing citee abstract:probabilistic latent semantic indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data. fitted from a training corpus of text documents by a generalization of the expectation maximization algorithm, the utilized model is able to deal with domain{specific synonymy as well as with polysemous words. in contrast to standard latent semantic indexing (lsi) by singular value decomposition, the probabilistic variant has a solid statistical foundation and defines a proper generative data model. retrieval experiments on a number of test collections indicate substantial performance gains over direct term matching methods as well as over lsi. in particular, the combination of models with di erent dimensionalities has proven to be advantageous surrounding text:documents are images and we quantize local appearance descriptions to form visual words [4, 18, 20, 26]<2>. the two models we investigate are the probabilistic latent semantic analysis (plsa) of hofmann [***, 10]<1>, and the latent dirichlet allocation (lda) of blei et al$ [3]<1>. both use the bag of words model, where positional relationships between features are ignored. in the case of plsa, the topic specific distributions p(wz) are learned from a separate set of training images. when observing a new unseen test image, the document specific mixing coefficients p(zdtest) are computed using the fold-in heuristic described in [***]<1>. in particular, the unseen image is projected on the simplex spanned by learned p(wz), i influence:1 type:1 pair index:480 citer id:436 citer title:discovering objects and their location in images citer abstract:given a set of images containing multiple object categories, we seek to discover those categories and their image locations without supervision. we achieve this using generative models from the statistical text literature: probabilistic latent semantic analysis (plsa), and latent dirichlet allocation (lda). in text analysis these are used to discover topics in a corpus using the bag-of-words document representation. here we discover topics as object categories, so that an image containing instances of several categories is modelled as a mixture of topics. the models are applied to images by using a visual analogue of a word, formed by vector quantizing sift like region descriptors. we investigate a set of increasingly demanding scenarios, starting with image sets containing only two object categories through to sets containing multiple categories (including airplanes, cars, faces, motorbikes, spotted cats) and background clutter. the object categories sample both intra-class and scale variation, and both the categories and their approximate spatial layout are found without supervision. we also demonstrate classification of unseen images and images containing multiple objects. performance of the proposed unsupervised method is compared to the semisupervised approach of citee id:428 citee title:unsupervised learning by probabilistic latent semantic analysis citee abstract:this paper presents a novel statistical method for factor analysis of binary and count data whichis closely related to a technique known as latent semantic analysis. in contrast to the latter method whichstems from linear algebra and performs a singular value decomposition of co-occurrence tables, the proposedtechnique uses a generative latent class model to perform a probabilistic mixture decomposition. this resultsin a more principled approach with a solid foundation in statistical inference. more precisely, we propose tomake use of a temperature controlled version of the expectation maximization algorithm for model fitting, whichhas shown excellent performance in practice. probabilistic latent semantic analysis has many applications, mostprominently in information retrieval, natural language processing, machine learning from text, and in related areas.the paper presents perplexity results for different types of text and linguistic data collections and discusses anapplication in automated document indexing. the experiments indicate substantial and consistent improvementsof the probabilistic method over standard latent semantic analysis. surrounding text:documents are images and we quantize local appearance descriptions to form visual words [4, 18, 20, 26]<2>. the two models we investigate are the probabilistic latent semantic analysis (plsa) of hofmann [9, ***]<1>, and the latent dirichlet allocation (lda) of blei et al$ [3]<1>. both use the bag of words model, where positional relationships between features are ignored. p(wd) and the fitted model. the model is fitted using the expectation maximization (em) algorithm as described in [***]<1>. lda: in contrast to plsa, lda treats the multinomial weights over topics as latent random variables influence:1 type:1 pair index:481 citer id:436 citer title:discovering objects and their location in images citer abstract:given a set of images containing multiple object categories, we seek to discover those categories and their image locations without supervision. we achieve this using generative models from the statistical text literature: probabilistic latent semantic analysis (plsa), and latent dirichlet allocation (lda). in text analysis these are used to discover topics in a corpus using the bag-of-words document representation. here we discover topics as object categories, so that an image containing instances of several categories is modelled as a mixture of topics. the models are applied to images by using a visual analogue of a word, formed by vector quantizing sift like region descriptors. we investigate a set of increasingly demanding scenarios, starting with image sets containing only two object categories through to sets containing multiple categories (including airplanes, cars, faces, motorbikes, spotted cats) and background clutter. the object categories sample both intra-class and scale variation, and both the categories and their approximate spatial layout are found without supervision. we also demonstrate classification of unseen images and images containing multiple objects. performance of the proposed unsupervised method is compared to the semisupervised approach of citee id:429 citee title:robust wide baseline stereo from maximally stable extremal regions citee abstract: the wide - baseline stereo problem, i e the problem of establishing correspon dences between a pair of images taken from different viewpoints is studied a new set of image elements that are put into correspondence, the so called extremal regions , is introduced sirable properties: the set is closed under 1 transformation of image coordinates and 2 age intensities an efficient (near linear complexity) and practically fast de tection algorithm (near frame rate) is presented for an affinely subset of extremal regions, the maximally stable extremal regions (mser) a new robust similarity measure for establishing tentative correspon dences is proposed the robustness ensures that invariants from multiple measurement regions (regions obtained by invariant constructions from ex tremal regions), some that are significantly larger (and hence discriminative) than the msers, may be used to establish tentative correspondences the high utility of msers, multiple measurement regions and the robust metric is demonstrated in wide - baseline experiments on image pairs from both indoor and outdoor scenes significant change of scale (3 nation conditions, out - of - plane rotation, occlusion , locally anisotropic scale change and 3d translation of the viewpoint are all present in the test prob lems good estimates of epipolar geometry (average distance from corre sponding points to the epipolar line below 0 are obtained surrounding text:to use plsa/lda generative statistical models, we seek a vocabulary of visual words which will be insensitive to changes in viewpoint and illumination. we use vector quantized sift descriptors [12]<1> computed on affine covariant regions [***, 14, 16]<1>. affine covariance gives us tolerance to viewpoint changes. the second is constructed using the maximally stable procedure of matas et al. [***]<1> where areas are selected from an intensity watershed image segmentation. for both of these we use the binaries provided at [23]<1> influence:3 type:1 pair index:482 citer id:436 citer title:discovering objects and their location in images citer abstract:given a set of images containing multiple object categories, we seek to discover those categories and their image locations without supervision. we achieve this using generative models from the statistical text literature: probabilistic latent semantic analysis (plsa), and latent dirichlet allocation (lda). in text analysis these are used to discover topics in a corpus using the bag-of-words document representation. here we discover topics as object categories, so that an image containing instances of several categories is modelled as a mixture of topics. the models are applied to images by using a visual analogue of a word, formed by vector quantizing sift like region descriptors. we investigate a set of increasingly demanding scenarios, starting with image sets containing only two object categories through to sets containing multiple categories (including airplanes, cars, faces, motorbikes, spotted cats) and background clutter. the object categories sample both intra-class and scale variation, and both the categories and their approximate spatial layout are found without supervision. we also demonstrate classification of unseen images and images containing multiple objects. performance of the proposed unsupervised method is compared to the semisupervised approach of citee id:206 citee title:an affine invariant interest point detector citee abstract:this paper presents a novel approach for detecting affine invariant interest pointsfi our method can deal with significant affine transformations including large scale changesfi such transformations introduce significant changes in the point location as well as in the scale and the shape of the neighbourhood of an interest pointfi our approach allows to solve for these problems simultaneouslyfi it is based on three key ideas: 1) the second moment matrix computed in a point can be used to normalize surrounding text:to use plsa/lda generative statistical models, we seek a vocabulary of visual words which will be insensitive to changes in viewpoint and illumination. we use vector quantized sift descriptors [12]<1> computed on affine covariant regions [13, ***, 16]<1>. affine covariance gives us tolerance to viewpoint changes. the first is constructed by elliptical shape adaptation about an interest point. the method is described in [***, 16]<1>. the second is constructed using the maximally stable procedure of matas et al influence:3 type:1 pair index:483 citer id:436 citer title:discovering objects and their location in images citer abstract:given a set of images containing multiple object categories, we seek to discover those categories and their image locations without supervision. we achieve this using generative models from the statistical text literature: probabilistic latent semantic analysis (plsa), and latent dirichlet allocation (lda). in text analysis these are used to discover topics in a corpus using the bag-of-words document representation. here we discover topics as object categories, so that an image containing instances of several categories is modelled as a mixture of topics. the models are applied to images by using a visual analogue of a word, formed by vector quantizing sift like region descriptors. we investigate a set of increasingly demanding scenarios, starting with image sets containing only two object categories through to sets containing multiple categories (including airplanes, cars, faces, motorbikes, spotted cats) and background clutter. the object categories sample both intra-class and scale variation, and both the categories and their approximate spatial layout are found without supervision. we also demonstrate classification of unseen images and images containing multiple objects. performance of the proposed unsupervised method is compared to the semisupervised approach of citee id:430 citee title:weak hypotheses and boosting for generic object detection and recognition citee abstract:fi in this paper we describe the first stage of a new learning system for object detection and recognitionfi for our system we propose boosting as the underlying learning techniquefi this allows the use of very diverse sets of visual features in the learning process within a com-mon framework: boosting, together with a weak hypotheses finder, may choose very inhomogeneous features as most relevant for combina-tion into a final hypothesisfi as another advantage the weak hypotheses finder may search the weak hypotheses space without explicit calculation of all available hypotheses, reducing computation timefi this contrasts the related work of agarwal and roth where winnow was used as learning algorithm and all weak hypotheses were calculated explicitlyfi in our first empirical evaluation we use four types of local descriptors: two basic ones consisting of a set of grayvalues and intensity moments and two high level descriptors: moment invariants and sifts fi the descriptors are calculated from local patches detected by an inter-est point operatorfi the weak hypotheses finder selects one of the local patches and one type of local descriptor and e ciently searches for the most discriminative similarity thresholdfi this di ers from other work on boosting for object recognition where simple rectangular hypotheses or complex classifiers have been usedfi in relatively simple images, where the objects are prominent, our approach yields results comparable to the state-of-the-art fi but we also obtain very good results on more complex images, where the objects are located in arbitrary positions, poses, and scales in the imagesfi these results indicate that our flexible approach, which also allows the inclusion of features from segmented re-gions and even spatial relationships, leads us a significant step towards generic object recognitionfi surrounding text:sift descriptors, based on histograms of local orientation, give some tolerance to illumination change. others have used similar descriptors for object classification [4, ***]<2>, but in a supervised setting. we compare the two statistical models with a control global texture model, similar to those proposed for preattentive vision [22]<2> and image retrieval [19]<2> influence:2 type:2 pair index:484 citer id:436 citer title:discovering objects and their location in images citer abstract:given a set of images containing multiple object categories, we seek to discover those categories and their image locations without supervision. we achieve this using generative models from the statistical text literature: probabilistic latent semantic analysis (plsa), and latent dirichlet allocation (lda). in text analysis these are used to discover topics in a corpus using the bag-of-words document representation. here we discover topics as object categories, so that an image containing instances of several categories is modelled as a mixture of topics. the models are applied to images by using a visual analogue of a word, formed by vector quantizing sift like region descriptors. we investigate a set of increasingly demanding scenarios, starting with image sets containing only two object categories through to sets containing multiple categories (including airplanes, cars, faces, motorbikes, spotted cats) and background clutter. the object categories sample both intra-class and scale variation, and both the categories and their approximate spatial layout are found without supervision. we also demonstrate classification of unseen images and images containing multiple objects. performance of the proposed unsupervised method is compared to the semisupervised approach of citee id:181 citee title:a statistical method for 3d object detection applied to faces and cars citee abstract:in this paper, we describe a statistical method for 3d object detectionfi we represent the statistics of both object appear-ance and "non-object" appearance using a product of histo-gramsfi each histogram represents the joint statistics of a subset of wavelet coefficients and their position on the objectfi our approach is to use many such histograms representing a wide variety of visual attributesfi using this method, we have devel-oped the first algorithm that can reliably detect human faces with out-of-plane rotation and the first algorithm that can reli-ably detect passenger cars over a wide range of viewpointsfi surrounding text:introduction common approaches to object recognition involve some form of supervision. this may range from specifying the objects location and segmentation, as in face detection [***, 24]<2>, to providing only auxiliary data indicating the objects identity [1, 5, 7, 25]<2>. for a large dataset, any annotation is expensive, or may introduce unforeseen biases influence:2 type:2 pair index:485 citer id:436 citer title:discovering objects and their location in images citer abstract:given a set of images containing multiple object categories, we seek to discover those categories and their image locations without supervision. we achieve this using generative models from the statistical text literature: probabilistic latent semantic analysis (plsa), and latent dirichlet allocation (lda). in text analysis these are used to discover topics in a corpus using the bag-of-words document representation. here we discover topics as object categories, so that an image containing instances of several categories is modelled as a mixture of topics. the models are applied to images by using a visual analogue of a word, formed by vector quantizing sift like region descriptors. we investigate a set of increasingly demanding scenarios, starting with image sets containing only two object categories through to sets containing multiple categories (including airplanes, cars, faces, motorbikes, spotted cats) and background clutter. the object categories sample both intra-class and scale variation, and both the categories and their approximate spatial layout are found without supervision. we also demonstrate classification of unseen images and images containing multiple objects. performance of the proposed unsupervised method is compared to the semisupervised approach of citee id:431 citee title:video google: a text retrieval approach to object matching in videos citee abstract:we describe an approach to object and scene retrieval which searches for and localizes all the occurrences of a user outlined object in a video. the object is represented by a set of viewpoint invariant region descriptors so that recognition can proceed successfully despite changes in viewpoint, illumination and partial occlusion. the temporal continuity of the video within a shot is used to track the regions in order to reject unstable regions and reduce the effects of noise in the descriptors. the analogy with text retrieval is in the implementation where matches on descriptors are pre-computed (using vector quantization), and inverted file systems and document rankings are used. the result is that retrieval is immediate, returning a ranked list of key frames/shots in the manner of google. the method is illustrated for matching on two full length feature films. surrounding text:we apply models used in statistical natural language processing to discover object categories and their image layout analogously to topic discovery in text. documents are images and we quantize local appearance descriptions to form visual words [4, ***, 20, 26]<2>. the two models we investigate are the probabilistic latent semantic analysis (plsa) of hofmann [9, 10]<1>, and the latent dirichlet allocation (lda) of blei et al$ [3]<1> influence:2 type:2 pair index:486 citer id:436 citer title:discovering objects and their location in images citer abstract:given a set of images containing multiple object categories, we seek to discover those categories and their image locations without supervision. we achieve this using generative models from the statistical text literature: probabilistic latent semantic analysis (plsa), and latent dirichlet allocation (lda). in text analysis these are used to discover topics in a corpus using the bag-of-words document representation. here we discover topics as object categories, so that an image containing instances of several categories is modelled as a mixture of topics. the models are applied to images by using a visual analogue of a word, formed by vector quantizing sift like region descriptors. we investigate a set of increasingly demanding scenarios, starting with image sets containing only two object categories through to sets containing multiple categories (including airplanes, cars, faces, motorbikes, spotted cats) and background clutter. the object categories sample both intra-class and scale variation, and both the categories and their approximate spatial layout are found without supervision. we also demonstrate classification of unseen images and images containing multiple objects. performance of the proposed unsupervised method is compared to the semisupervised approach of citee id:432 citee title:improved video content indexing by multiple latent semantic analysis citee abstract:low-level features are now becoming insufficient to build efficient content-based retrieval systems. users are not interested any longer in retrieving visually similar content, but they expect retrieval systems to also find documents with similar semantic content. bridging the gap between low-level features and semantic content is a challenging task necessary for future retrieval systems. latent semantic analysis (lsa) was successfully introduced to efficiently index text documents by detecting synonyms and the polysemy of words. we have successfully proposed an adaptation of lsa to model video content for object retrieval and semantic content estimation. following this idea we now present a new model composed of multiple lsas (m-lsa) to better represent the video content. in the experimental section, we make a comparison of lsa and m-lsa on two problems, namely object retrieval and semantic content estimation. surrounding text:others have used similar descriptors for object classification [4, 15]<2>, but in a supervised setting. we compare the two statistical models with a control global texture model, similar to those proposed for preattentive vision [22]<2> and image retrieval [***]<2>. sect influence:3 type:2 pair index:487 citer id:436 citer title:discovering objects and their location in images citer abstract:given a set of images containing multiple object categories, we seek to discover those categories and their image locations without supervision. we achieve this using generative models from the statistical text literature: probabilistic latent semantic analysis (plsa), and latent dirichlet allocation (lda). in text analysis these are used to discover topics in a corpus using the bag-of-words document representation. here we discover topics as object categories, so that an image containing instances of several categories is modelled as a mixture of topics. the models are applied to images by using a visual analogue of a word, formed by vector quantizing sift like region descriptors. we investigate a set of increasingly demanding scenarios, starting with image sets containing only two object categories through to sets containing multiple categories (including airplanes, cars, faces, motorbikes, spotted cats) and background clutter. the object categories sample both intra-class and scale variation, and both the categories and their approximate spatial layout are found without supervision. we also demonstrate classification of unseen images and images containing multiple objects. performance of the proposed unsupervised method is compared to the semisupervised approach of citee id:437 citee title:hierarchical dirichlet processes citee abstract:we consider problems involving groups of data, where each observation within a group is a draw from a mixture model, and where it is desirable to share mixture components between groups. we assume that the number of mixture components is unknown a priori and is to be inferred from the data. in this setting it is natural to consider sets of dirichlet processes, one for each group, where the well-known clustering property of the dirichlet process provides a nonparametric prior for the number of mixture components within each group. given our desire to tie the mixture models in the various groups, we consider a hierarchical model, speci.cally one in which the base measure for the child dirichlet processes is itself distributed according to a dirichlet process. such a base measure being discrete, the child dirichlet processes necessar-ily share atoms. thus, as desired, the mixture models in the different groups necessarily share mixture components. we discuss representations of hierarchical dirichlet processes in terms of a stick-breaking process, and a generalization of the chinese restaurant process that we refer to as the chinese restaurant franchise. we present markov chain monte carlo algorithms for posterior inference in hierarchical dirichlet process mixtures, and describe applications to problems in information retrieval and text modelling surrounding text:even though the new categories all had substantially fewer images (around 200), the results are still encouraging. discussion: in the experiments it was necessary to specify the number of topics k, however bayesian [***]<3> or minimum complexity methods [2]<3> can be used to infer the number of topics implied by a corpus. while designing these experiments, we grew to appreciate the many difficulties in searching for good datasets influence:3 type:3 pair index:488 citer id:436 citer title:discovering objects and their location in images citer abstract:given a set of images containing multiple object categories, we seek to discover those categories and their image locations without supervision. we achieve this using generative models from the statistical text literature: probabilistic latent semantic analysis (plsa), and latent dirichlet allocation (lda). in text analysis these are used to discover topics in a corpus using the bag-of-words document representation. here we discover topics as object categories, so that an image containing instances of several categories is modelled as a mixture of topics. the models are applied to images by using a visual analogue of a word, formed by vector quantizing sift like region descriptors. we investigate a set of increasingly demanding scenarios, starting with image sets containing only two object categories through to sets containing multiple categories (including airplanes, cars, faces, motorbikes, spotted cats) and background clutter. the object categories sample both intra-class and scale variation, and both the categories and their approximate spatial layout are found without supervision. we also demonstrate classification of unseen images and images containing multiple objects. performance of the proposed unsupervised method is compared to the semisupervised approach of citee id:371 citee title:contextual priming for object detection citee abstract:there is general consensus that context can be a rich source of information about an object"s identity,location and scalefi in fact, the structure of many real-world scenes is governed by strong configurationalrules akin to those that apply to a single objectfi here we introduce a simple framework for modelingthe relationship between context and object properties based on the correlation between the statisticsof low-level features across the entire scene and the objects that it containsfi the surrounding text:others have used similar descriptors for object classification [4, 15]<2>, but in a supervised setting. we compare the two statistical models with a control global texture model, similar to those proposed for preattentive vision [***]<2> and image retrieval [19]<2>. sect influence:2 type:2 pair index:489 citer id:436 citer title:discovering objects and their location in images citer abstract:given a set of images containing multiple object categories, we seek to discover those categories and their image locations without supervision. we achieve this using generative models from the statistical text literature: probabilistic latent semantic analysis (plsa), and latent dirichlet allocation (lda). in text analysis these are used to discover topics in a corpus using the bag-of-words document representation. here we discover topics as object categories, so that an image containing instances of several categories is modelled as a mixture of topics. the models are applied to images by using a visual analogue of a word, formed by vector quantizing sift like region descriptors. we investigate a set of increasingly demanding scenarios, starting with image sets containing only two object categories through to sets containing multiple categories (including airplanes, cars, faces, motorbikes, spotted cats) and background clutter. the object categories sample both intra-class and scale variation, and both the categories and their approximate spatial layout are found without supervision. we also demonstrate classification of unseen images and images containing multiple objects. performance of the proposed unsupervised method is compared to the semisupervised approach of citee id:433 citee title:rapid object detection using a boosted cascade of simple features citee abstract:this paper describes a machine learning approach for visual object detection which is capable of processing images extremely rapidly and achieving high detection ratesfi this work is distinguished by three key contributionsfi the first is the introduction of a new image representation called the "integral image" which allows the features used by our detector to be computed very quicklyfi the second is a learning algorithm, based on adaboost, which selects a small number of critical visual features surrounding text:introduction common approaches to object recognition involve some form of supervision. this may range from specifying the objects location and segmentation, as in face detection [17, ***]<2>, to providing only auxiliary data indicating the objects identity [1, 5, 7, 25]<2>. for a large dataset, any annotation is expensive, or may introduce unforeseen biases influence:2 type:2 pair index:490 citer id:436 citer title:discovering objects and their location in images citer abstract:given a set of images containing multiple object categories, we seek to discover those categories and their image locations without supervision. we achieve this using generative models from the statistical text literature: probabilistic latent semantic analysis (plsa), and latent dirichlet allocation (lda). in text analysis these are used to discover topics in a corpus using the bag-of-words document representation. here we discover topics as object categories, so that an image containing instances of several categories is modelled as a mixture of topics. the models are applied to images by using a visual analogue of a word, formed by vector quantizing sift like region descriptors. we investigate a set of increasingly demanding scenarios, starting with image sets containing only two object categories through to sets containing multiple categories (including airplanes, cars, faces, motorbikes, spotted cats) and background clutter. the object categories sample both intra-class and scale variation, and both the categories and their approximate spatial layout are found without supervision. we also demonstrate classification of unseen images and images containing multiple objects. performance of the proposed unsupervised method is compared to the semisupervised approach of citee id:434 citee title:unsupervised learning of models for recognition citee abstract:fi we present a method to learn object class models from unlabeled andunsegmented cluttered scenes for the purpose of visual object recognitionfi wefocus on a particular type of model where objects are represented as flexible constellationsof rigid parts (features)fi the variability within a class is representedby a joint probability density function (pdf) on the shape of the constellation andthe output of part detectorsfi in a first stage, the method automatically identifiesdistinctive surrounding text:introduction common approaches to object recognition involve some form of supervision. this may range from specifying the objects location and segmentation, as in face detection [17, 24]<2>, to providing only auxiliary data indicating the objects identity [1, 5, 7, ***]<2>. for a large dataset, any annotation is expensive, or may introduce unforeseen biases influence:2 type:2 pair index:491 citer id:436 citer title:discovering objects and their location in images citer abstract:given a set of images containing multiple object categories, we seek to discover those categories and their image locations without supervision. we achieve this using generative models from the statistical text literature: probabilistic latent semantic analysis (plsa), and latent dirichlet allocation (lda). in text analysis these are used to discover topics in a corpus using the bag-of-words document representation. here we discover topics as object categories, so that an image containing instances of several categories is modelled as a mixture of topics. the models are applied to images by using a visual analogue of a word, formed by vector quantizing sift like region descriptors. we investigate a set of increasingly demanding scenarios, starting with image sets containing only two object categories through to sets containing multiple categories (including airplanes, cars, faces, motorbikes, spotted cats) and background clutter. the object categories sample both intra-class and scale variation, and both the categories and their approximate spatial layout are found without supervision. we also demonstrate classification of unseen images and images containing multiple objects. performance of the proposed unsupervised method is compared to the semisupervised approach of citee id:435 citee title:hidden semantic concept discovery in region based image retrieval citee abstract:this work addresses content based image retrieval (cbir), focusing on developing a hidden semantic concept discovery methodology to address effective semantics-intensive image retrieval. in our approach, each image in the database is segmented to region; associated with homogenous color, texture, and shape features. by exploiting regional statistical information in each image and employing a vector quantization method, a uniform and sparse region-based representation is achieved. with this representation a probabilistic model based on statistical-hidden-class assumptions of the image database is obtained, to which expectation-maximization (em) technique is applied to analyze semantic concepts hidden in the database. an elaborated retrieval algorithm is designed to support the probabilistic model. the semantic similarity is measured through integrating the posterior probabilities of the transformed query image, as well as a constructed negative example, to the discovered semantic concepts. the proposed approach has a solid statistical foundation and the experimental evaluations on a database of 10,000 general-purposed images demonstrate its promise of the effectiveness. surrounding text:we apply models used in statistical natural language processing to discover object categories and their image layout analogously to topic discovery in text. documents are images and we quantize local appearance descriptions to form visual words [4, 18, 20, ***]<2>. the two models we investigate are the probabilistic latent semantic analysis (plsa) of hofmann [9, 10]<1>, and the latent dirichlet allocation (lda) of blei et al$ [3]<1> influence:2 type:2 pair index:492 citer id:437 citer title:hierarchical dirichlet processes citer abstract:we consider problems involving groups of data, where each observation within a group is a draw from a mixture model, and where it is desirable to share mixture components between groups. we assume that the number of mixture components is unknown a priori and is to be inferred from the data. in this setting it is natural to consider sets of dirichlet processes, one for each group, where the well-known clustering property of the dirichlet process provides a nonparametric prior for the number of mixture components within each group. given our desire to tie the mixture models in the various groups, we consider a hierarchical model, speci.cally one in which the base measure for the child dirichlet processes is itself distributed according to a dirichlet process. such a base measure being discrete, the child dirichlet processes necessar-ily share atoms. thus, as desired, the mixture models in the different groups necessarily share mixture components. we discuss representations of hierarchical dirichlet processes in terms of a stick-breaking process, and a generalization of the chinese restaurant process that we refer to as the chinese restaurant franchise. we present markov chain monte carlo algorithms for posterior inference in hierarchical dirichlet process mixtures, and describe applications to problems in information retrieval and text modelling citee id:602 citee title:hierarchical bayesian models for applications in information retrieval citee abstract:: this article, we find that most of the factors are very close to # whilefour of the factors achieve significant expected countsfi looking at the distribution over words,z), for those four factors, we can identify the topics which mixed via the # random variableto form this document (figure 2) surrounding text:it is also common to view the words in a document as arising from a number of latent clusters or topics, where a topic is generally modeled as a multinomial probability distribution on words from some basic vocabulary (blei et al. 2003)[***]<1>. thus, in a document concerned with university funding the words in the document might be drawn from the topics education and . a parametric approach to this problem is provided by the latent dirichlet allocation (lda) model of blei et al. (2003)[***]<1>. this model involves a influence:2 type:1 pair index:493 citer id:437 citer title:hierarchical dirichlet processes citer abstract:we consider problems involving groups of data, where each observation within a group is a draw from a mixture model, and where it is desirable to share mixture components between groups. we assume that the number of mixture components is unknown a priori and is to be inferred from the data. in this setting it is natural to consider sets of dirichlet processes, one for each group, where the well-known clustering property of the dirichlet process provides a nonparametric prior for the number of mixture components within each group. given our desire to tie the mixture models in the various groups, we consider a hierarchical model, speci.cally one in which the base measure for the child dirichlet processes is itself distributed according to a dirichlet process. such a base measure being discrete, the child dirichlet processes necessar-ily share atoms. thus, as desired, the mixture models in the different groups necessarily share mixture components. we discuss representations of hierarchical dirichlet processes in terms of a stick-breaking process, and a generalization of the chinese restaurant process that we refer to as the chinese restaurant franchise. we present markov chain monte carlo algorithms for posterior inference in hierarchical dirichlet process mixtures, and describe applications to problems in information retrieval and text modelling citee id:596 citee title:hidden markov model induction by bayesian model merging citee abstract:this paper describes a technique for learning both the number of states and the topology of hidden markov models from examples. the induction process starts with the most speci.c model consistent with the training data and generalizes by successively merging states. both the choice of states to merge and the stopping criterion are guided by the bayesian posterior probability. we compare our algorithm with the baum-welch method of estimating .xed-size models, and .nd that it can induce minimal hmms from data in cases where .xed estimation does not converge or requires redundant parameters to converge surrounding text:the hdp-hmm provides an alternative to methods that place an explicit para-metric prior on the number of states or make use of model selection methods to select a . xed number of states (stolcke and omohundro 1993)[***]<3>. in work that served as an inspiration for the hdp-hmm, beal et al influence:2 type:3 pair index:494 citer id:448 citer title:dynamic social network analysis using latent space models citer abstract:this paper explores two aspects of social network modeling. first, we generalize a successful static model of relationships into a dynamic model that accounts for friendships drifting over time. second, we show how to make it tractable to learn such models from data, even as the number of entities n gets large. the generalized model associates each entity with a point in p-dimensional euclidean latent space. the points can move as time progresses but large moves in latent space are improbable. observed links between entities are more likely if the entities are close in latent space. we show how to make such a model tractable (sub-quadratic in the number of entities) by the use of appropriate kernel functions for similarity in latent space; the use of low dimensional kd-trees; a new efficient dynamic adaptation of multidimensional scaling for a first pass of approximate projection of entities into latent space; and an efficient conjugate gradient update rule for non-linear local optimization in which amortized time per entity during an update is o(log n). we use both synthetic and real-world data on up to 11,000 entities which indicate near-linear scaling in computation time and improved performance over four alternative approaches. we also illustrate the system operating on twelve years of nips co-authorship data citee id:389 citee title:crimelink explorer: using domain knowledge to facilitate automated crime association analysis citee abstract:link (association) analysis has been used in law enforcement and intelligence domains to extract and search associations between people from large datasets. nonetheless, link analysis still faces many challenging problems, such as information overload, high search complexity, and heavy reliance on domain knowledge. to address these challenges and enable crime investigators to conduct automated, effective, and efficient link analysis, we proposed three techniques which include: the concept space approach, a shortest-path algorithm, and a heuristic approach that captures domain knowledge for determining importance of associations. we implemented a system called crimelink explorer based on the proposed techniques. results from our user study involving ten crime investigators from the tucson police department showed that our system could help subjects conduct link analysis more efficiently. additionally, subjects concluded that association paths found based on the heuristic approach were more accurate than those found based on the concept space approach. surrounding text:1. introduction social network analysis is becoming increasingly important in many fields besides sociology, including intelligence analysis [***]<3>, marketing [2]<3> and recommender systems [3]<3>. here we consider learning in systems in which relationships drift over time influence:2 type:3 pair index:495 citer id:448 citer title:dynamic social network analysis using latent space models citer abstract:this paper explores two aspects of social network modeling. first, we generalize a successful static model of relationships into a dynamic model that accounts for friendships drifting over time. second, we show how to make it tractable to learn such models from data, even as the number of entities n gets large. the generalized model associates each entity with a point in p-dimensional euclidean latent space. the points can move as time progresses but large moves in latent space are improbable. observed links between entities are more likely if the entities are close in latent space. we show how to make such a model tractable (sub-quadratic in the number of entities) by the use of appropriate kernel functions for similarity in latent space; the use of low dimensional kd-trees; a new efficient dynamic adaptation of multidimensional scaling for a first pass of approximate projection of entities into latent space; and an efficient conjugate gradient update rule for non-linear local optimization in which amortized time per entity during an update is o(log n). we use both synthetic and real-world data on up to 11,000 entities which indicate near-linear scaling in computation time and improved performance over four alternative approaches. we also illustrate the system operating on twelve years of nips co-authorship data citee id:301 citee title:clustering of bipartite advertiser-keyword graph citee abstract:: in this paper we present top-down and bottom-up hierarchicalclustering methods for large bipartite graphsfithe top down approach employs a flow-based graph partitioningmethod, while the bottom up approach is a multiroundhybrid of the single-link and average-link agglomerativeclustering methodsfi we evaluate the quality of clustersobtained by these two methods using additional textual informationand compare the results against other clusteringtechniquesfi surrounding text:1. introduction social network analysis is becoming increasingly important in many fields besides sociology, including intelligence analysis [1]<3>, marketing [***]<3> and recommender systems [3]<3>. here we consider learning in systems in which relationships drift over time influence:2 type:3 pair index:496 citer id:448 citer title:dynamic social network analysis using latent space models citer abstract:this paper explores two aspects of social network modeling. first, we generalize a successful static model of relationships into a dynamic model that accounts for friendships drifting over time. second, we show how to make it tractable to learn such models from data, even as the number of entities n gets large. the generalized model associates each entity with a point in p-dimensional euclidean latent space. the points can move as time progresses but large moves in latent space are improbable. observed links between entities are more likely if the entities are close in latent space. we show how to make such a model tractable (sub-quadratic in the number of entities) by the use of appropriate kernel functions for similarity in latent space; the use of low dimensional kd-trees; a new efficient dynamic adaptation of multidimensional scaling for a first pass of approximate projection of entities into latent space; and an efficient conjugate gradient update rule for non-linear local optimization in which amortized time per entity during an update is o(log n). we use both synthetic and real-world data on up to 11,000 entities which indicate near-linear scaling in computation time and improved performance over four alternative approaches. we also illustrate the system operating on twelve years of nips co-authorship data citee id:326 citee title:collaboration analysis in recommender systems using social networks citee abstract:: many researchers have focussed their efforts in developingcollaborative recommender systemsfi it has been provedthat the use of collaboration in such systems improves itsperformance, but what is not known is how this collaborationis done and what is more important, how it has tobe done in order to optimise the information exchangefi thecollaborative relationships in recommender systems can berepresented as a social networkfi in this paper we proposeseveral measures to analyse collaboration surrounding text:1. introduction social network analysis is becoming increasingly important in many fields besides sociology, including intelligence analysis [1]<3>, marketing [2]<3> and recommender systems [***]<3>. here we consider learning in systems in which relationships drift over time influence:2 type:3 pair index:497 citer id:448 citer title:dynamic social network analysis using latent space models citer abstract:this paper explores two aspects of social network modeling. first, we generalize a successful static model of relationships into a dynamic model that accounts for friendships drifting over time. second, we show how to make it tractable to learn such models from data, even as the number of entities n gets large. the generalized model associates each entity with a point in p-dimensional euclidean latent space. the points can move as time progresses but large moves in latent space are improbable. observed links between entities are more likely if the entities are close in latent space. we show how to make such a model tractable (sub-quadratic in the number of entities) by the use of appropriate kernel functions for similarity in latent space; the use of low dimensional kd-trees; a new efficient dynamic adaptation of multidimensional scaling for a first pass of approximate projection of entities into latent space; and an efficient conjugate gradient update rule for non-linear local optimization in which amortized time per entity during an update is o(log n). we use both synthetic and real-world data on up to 11,000 entities which indicate near-linear scaling in computation time and improved performance over four alternative approaches. we also illustrate the system operating on twelve years of nips co-authorship data citee id:449 citee title:modern multidimensional scaling citee abstract:the book provides a comprehensive treatment of multidimensional scaling (mds), a family of statistical techniques for analyzing the structure of (dis)similarity data. such data are widespread, including, for example, intercorrelations of survey items, direct ratings on the similarity on choice objects, or trade indices for a set of countries. mds represents the data as distances among points in a geometric space of low dimensionality. this map can help to see patterns in the data that are not obvious from the data matrices. mds is also used as a psychological model for judgments of similarity and preference. this book may be used as an introduction to mds for students in the psychology, sociology, and marketing, in particular. the prerequisite is some elementary background in statistics. the book is also well suited for a variety of advanced courses on mds topics. all the mathematics required for more advanced topics is developed systematically. surrounding text:1 n otherwise however we do not know the true distance matrix d or the resulting similarity matrix ~d. therefore classical mds works with the dissimilarity matrix d obtained from the data [***]<1>. ~d is the similarity matrix obtained from d using ~d = h(. . xxt jf where j  jf denotes the frobenius norm [***]<1>. two questions remain influence:3 type:3 pair index:498 citer id:448 citer title:dynamic social network analysis using latent space models citer abstract:this paper explores two aspects of social network modeling. first, we generalize a successful static model of relationships into a dynamic model that accounts for friendships drifting over time. second, we show how to make it tractable to learn such models from data, even as the number of entities n gets large. the generalized model associates each entity with a point in p-dimensional euclidean latent space. the points can move as time progresses but large moves in latent space are improbable. observed links between entities are more likely if the entities are close in latent space. we show how to make such a model tractable (sub-quadratic in the number of entities) by the use of appropriate kernel functions for similarity in latent space; the use of low dimensional kd-trees; a new efficient dynamic adaptation of multidimensional scaling for a first pass of approximate projection of entities into latent space; and an efficient conjugate gradient update rule for non-linear local optimization in which amortized time per entity during an update is o(log n). we use both synthetic and real-world data on up to 11,000 entities which indicate near-linear scaling in computation time and improved performance over four alternative approaches. we also illustrate the system operating on twelve years of nips co-authorship data citee id:451 citee title:optimization with em and expectation-conjugategradient citee abstract:we show a close relationship between the expectation - maximization (em) algorithm and direct optimization algorithms such as gradientbased methods for parameter learning. we identify analytic conditions under which em exhibits newton-like behavior, and conditions under which it possesses poor, first-order convergence surrounding text:5. results we report experiments on synthetic data generated by a model described below and the nips co-authorship data [***]<3>, and some large subsets of citeseer. we investigate three things: ability of the algorithm to reconstruct the latent space based only on link observations, anecdotal evaluation of what happens to the nips data, and scalability influence:3 type:3 pair index:499 citer id:452 citer title:dynamic topic models citer abstract:a family of probabilistic time series models is developed to analyze the time evolution of topics in large document collections. the approach is to use state space models on the natural parameters of the multinomial distributions that represent the topics. variational approximations based on kalman filters and nonparametric wavelet regression are developed to carry out approximate posterior inference over the latent topics. in addition to giving quantitative, predictive models of a sequential corpus, dynamic topic models provide a qualitative window into the contents of a large document collection. the models are demonstrated by analyzing the ocred archives of the journal science from 1880 through 2000 citee id:164 citee title:latent dirichlet allocation citee abstract:we describe latent dirichlet allocation (lda), a generative probabilistic model for collections of discrete data such as text corpora. lda is a three-level hierarchical bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. in the context of text modeling, the topic probabilities provide an explicit representation of a document. we present efficient approximate inference techniques based on variational methods and an em algorithm for empirical bayes parameter estimation. we report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic lsi model surrounding text:first, we review the underlying statistical assumptions of a static topic model, such as latent dirichlet allocation (lda) (blei et al. , 2003)[***]<1>. let fi1:k be k topics, each of which is a distribution over a fixed vocabulary influence:1 type:1 pair index:500 citer id:452 citer title:dynamic topic models citer abstract:a family of probabilistic time series models is developed to analyze the time evolution of topics in large document collections. the approach is to use state space models on the natural parameters of the multinomial distributions that represent the topics. variational approximations based on kalman filters and nonparametric wavelet regression are developed to carry out approximate posterior inference over the latent topics. in addition to giving quantitative, predictive models of a sequential corpus, dynamic topic models provide a qualitative window into the contents of a large document collection. the models are demonstrated by analyzing the ocred archives of the journal science from 1880 through 2000 citee id:381 citee title:correlated topic models citee abstract:topic models, such as latent dirichlet allocation (lda), can be useful tools for the statistical analysis of document collections and other discrete data. the lda model assumes that the words of each document arise from a mixture of topics, each of which is a distribution over the vocabulary. a limitation of lda is the inability to model topic correlation even though, for example, a document about genetics is more likely to also be about disease than x-ray astronomy. this limitation stems from the use of the dirichlet distribution to model the variability among the topic proportions. in this paper we develop the correlated topic model (ctm), where the topic proportions exhibit correlation via the logistic normal distribution . we derive a mean-field variational inference algorithm for approximate posterior inference in this model, which is complicated by the fact that the logistic normal is not conjugate to the multinomial. the ctm gives a better fit than lda on a collection of ocred articles from the journal science. furthermore, the ctm provides a natural way of visualizing and exploring this and other unstructured data sets surrounding text:1, ffi2i) . (2) for simplicity, we do not model the dynamics of topic correlation, as was done for static models by blei and lafferty (2006)[***]<1>. by chaining together topics and topic proportion distributions, we have sequentially tied a collection of topic models influence:2 type:2 pair index:501 citer id:452 citer title:dynamic topic models citer abstract:a family of probabilistic time series models is developed to analyze the time evolution of topics in large document collections. the approach is to use state space models on the natural parameters of the multinomial distributions that represent the topics. variational approximations based on kalman filters and nonparametric wavelet regression are developed to carry out approximate posterior inference over the latent topics. in addition to giving quantitative, predictive models of a sequential corpus, dynamic topic models provide a qualitative window into the contents of a large document collection. the models are demonstrated by analyzing the ocred archives of the journal science from 1880 through 2000 citee id:426 citee title:finding scientific topics citee abstract:a first step in identifying the content of a document is determining which topics that document addresses. we describe a generative model for documents, introduced by blei, ng, and jordan , in which each document is generated by choosing a distribution over topics and then choosing each word in the document from a topic selected according to this distribution. we then present a markov chain monte carlo algorithm for inference in this model. we use this algorithm to analyze s from pnas by using bayesian model selection to establish the number of topics. we show that the extracted topics capture meaningful structure in the data, consistent with the class designations provided by the authors of the articles, and outline further applications of this analysis, including identifying hot topics by examining temporal dynamics and tagging abstracts to illustrate semantic content. surrounding text:we use variational methods as deterministic alternatives to stochastic simulation, in order to handle the large data sets typical of text analysis. while gibbs sampling has been effectively used for static topic models (griffiths and steyvers, 2004)[***]<1>, nonconjugacy makes sampling methods more difficult for this dynamic model. the idea behind variational methods is to optimize the free parameters of a distribution over the latent variables so that the distribution is close in kullback-liebler (kl) divergence to the true posterior influence:1 type:1 pair index:502 citer id:452 citer title:dynamic topic models citer abstract:a family of probabilistic time series models is developed to analyze the time evolution of topics in large document collections. the approach is to use state space models on the natural parameters of the multinomial distributions that represent the topics. variational approximations based on kalman filters and nonparametric wavelet regression are developed to carry out approximate posterior inference over the latent topics. in addition to giving quantitative, predictive models of a sequential corpus, dynamic topic models provide a qualitative window into the contents of a large document collection. the models are demonstrated by analyzing the ocred archives of the journal science from 1880 through 2000 citee id:144 citee title:a new approach to linear filtering and prediction problems citee abstract:the classical filtering and prediction problem is re-examined using the bodeshannon representation of random processes and the state transition method of analysis of dynamic systems. new results are: (1) the formulation and methods of solution of the problem apply without modification to stationary and nonstationary statistics and to growing-memory and infinitememory filters. (2) a nonlinear difference (or differential) equation is derived for the covariance matrix of the optimal estimation error. from the solution of this equation the coefficients of the difference (or differential) equation of the optimal linear filter are obtained without further calculations. (3) the filtering problem is shown to be the dual of the noise-free regulator problem. the new method developed here is applied to two well-known problems, confirming and extending earlier results. the discussion is largely self-contained and proceeds from first principles; basic concepts of the theory of random processes are reviewed in the appendix. surrounding text:t. using standard kalman filter calculations (kalman, 1960)[***]<3>, the forward mean and variance of the variational posterior are given by mt  e(fit . fi1:t) =  influence:3 type:3 pair index:503 citer id:452 citer title:dynamic topic models citer abstract:a family of probabilistic time series models is developed to analyze the time evolution of topics in large document collections. the approach is to use state space models on the natural parameters of the multinomial distributions that represent the topics. variational approximations based on kalman filters and nonparametric wavelet regression are developed to carry out approximate posterior inference over the latent topics. in addition to giving quantitative, predictive models of a sequential corpus, dynamic topic models provide a qualitative window into the contents of a large document collection. the models are demonstrated by analyzing the ocred archives of the journal science from 1880 through 2000 citee id:385 citee title:inference of population structure using multilocus genotype data citee abstract:we describe a model-based clustering method for using multilocus genotype data to infer population structure and assign individuals to populations. we assume a model in which there are k populations (where k may be unknown), each of which is characterized by a set of allele frequencies at each locus. individuals in the sample are assigned (probabilistically) to populations, or jointly to two or more populations if their genotypes indicate that they are admixed. our model does not assume a particular mutation process, and it can be applied to most of the commonly used genetic markers, provided that they are not closely linked. applications of our method include demonstrating the presence of population structure, assigning individuals to populations, studying hybrid zones, and identifying migrants and admixed individuals. we show that the method can produce highly accurate assignments using modest numbers of locie.g., seven microsatellite loci in an example using genotype data from an endangered bird species. the software used for this article is available from http://www.stats.ox.ac.uk/zpritch/home.html. surrounding text:, 2005)[6][12]<3>, biological data (pritchard et al. , 2000)[***]<3>, and survey data (erosheva, 2002)[5]<3>. in an exchangeable topic model, the words of each docuappearing in proceedings of the 23 rd international conference on machine learning, pittsburgh, pa, 2006 influence:2 type:3 pair index:504 citer id:452 citer title:dynamic topic models citer abstract:a family of probabilistic time series models is developed to analyze the time evolution of topics in large document collections. the approach is to use state space models on the natural parameters of the multinomial distributions that represent the topics. variational approximations based on kalman filters and nonparametric wavelet regression are developed to carry out approximate posterior inference over the latent topics. in addition to giving quantitative, predictive models of a sequential corpus, dynamic topic models provide a qualitative window into the contents of a large document collection. the models are demonstrated by analyzing the ocred archives of the journal science from 1880 through 2000 citee id:454 citee title:the author-topic model for authors and documents citee abstract:we introduce the author-topic model, a generative model for documents that extends latent dirichlet allocation (lda; blei, ng, & jordan, 2003) to include authorship information. each author is associated with a multinomial distribution over topics and each topic is associated with a multinomial distribution over words. a document with multiple authors is modeled as a distribution over topics that is a mixture of the distributions associated with the authors. we apply the model to a collection of 1,700 nips conference papers and 160,000 citeseer s. exact inference is intractable for these datasets and we use gibbs sampling to estimate the topic and author distributions. we compare the performance with two other generative models for documents, which are special cases of the author-topic model: lda (a topic model) and a simple author model in which each author is associated with a distribution over words rather than a distribution over topics. we show topics recovered by the authortopic model, and demonstrate applications to computing similarity between authors and entropy of author output surrounding text:buntine and jakulin, 2004. blei and lafferty, 2006)[2][3][4][7][9][***]<2>. these models are called topic models because the discovered patterns often reflect the underlying topics which combined to form the documents influence:2 type:1 pair index:505 citer id:452 citer title:dynamic topic models citer abstract:a family of probabilistic time series models is developed to analyze the time evolution of topics in large document collections. the approach is to use state space models on the natural parameters of the multinomial distributions that represent the topics. variational approximations based on kalman filters and nonparametric wavelet regression are developed to carry out approximate posterior inference over the latent topics. in addition to giving quantitative, predictive models of a sequential corpus, dynamic topic models provide a qualitative window into the contents of a large document collection. the models are demonstrated by analyzing the ocred archives of the journal science from 1880 through 2000 citee id:436 citee title:discovering objects and their location in images citee abstract:given a set of images containing multiple object categories, we seek to discover those categories and their image locations without supervision. we achieve this using generative models from the statistical text literature: probabilistic latent semantic analysis (plsa), and latent dirichlet allocation (lda). in text analysis these are used to discover topics in a corpus using the bag-of-words document representation. here we discover topics as object categories, so that an image containing instances of several categories is modelled as a mixture of topics. the models are applied to images by using a visual analogue of a word, formed by vector quantizing sift like region descriptors. we investigate a set of increasingly demanding scenarios, starting with image sets containing only two object categories through to sets containing multiple categories (including airplanes, cars, faces, motorbikes, spotted cats) and background clutter. the object categories sample both intra-class and scale variation, and both the categories and their approximate spatial layout are found without supervision. we also demonstrate classification of unseen images and images containing multiple objects. performance of the proposed unsupervised method is compared to the semisupervised approach of surrounding text:sivic et al. , 2005)[6][***]<3>, biological data (pritchard et al. , 2000)[10]<3>, and survey data (erosheva, 2002)[5]<3> influence:2 type:3 pair index:506 citer id:452 citer title:dynamic topic models citer abstract:a family of probabilistic time series models is developed to analyze the time evolution of topics in large document collections. the approach is to use state space models on the natural parameters of the multinomial distributions that represent the topics. variational approximations based on kalman filters and nonparametric wavelet regression are developed to carry out approximate posterior inference over the latent topics. in addition to giving quantitative, predictive models of a sequential corpus, dynamic topic models provide a qualitative window into the contents of a large document collection. the models are demonstrated by analyzing the ocred archives of the journal science from 1880 through 2000 citee id:455 citee title:sparse gaussian processes using pseudo-inputs citee abstract:: we present a new gaussian process (gp) regression model whose covarianceis parameterized by the the locations of m pseudo-input points,which we learn by a gradient based optimizationfi we take mn ,where n is the number of real data points, and hence obtain a sparseregression method which hasn) training cost and) predictioncost per test casefi we also find hyperparameters of the covariancefunction in the same joint optimizationfi the method can be viewedas a bayesian surrounding text:(a similar technique for gaussian processes is described in snelson and ghahramani, 2006. )[***]<1> the variational distribution of the document-level latent a a a . influence:3 type:1 pair index:507 citer id:452 citer title:dynamic topic models citer abstract:a family of probabilistic time series models is developed to analyze the time evolution of topics in large document collections. the approach is to use state space models on the natural parameters of the multinomial distributions that represent the topics. variational approximations based on kalman filters and nonparametric wavelet regression are developed to carry out approximate posterior inference over the latent topics. in addition to giving quantitative, predictive models of a sequential corpus, dynamic topic models provide a qualitative window into the contents of a large document collection. the models are demonstrated by analyzing the ocred archives of the journal science from 1880 through 2000 citee id:205 citee title:all of nonparametric statistics citee abstract:this text provides the reader with a single book where they can find accounts of a number of up-to-date issues in nonparametric inference. the book is aimed at masters or phd level students in statistics, computer science, and engineering. it is also suitable for researchers who want to get up to speed quickly on modern nonparametric methods. it covers a wide range of topics including the bootstrap, the nonparametric delta method, nonparametric regression, density estimation, orthogonal function methods, minimax estimation, nonparametric confidence sets, and wavelets. the books dual approach includes a mixture of methodology and theory. surrounding text:variational wavelet regression the variational kalman filter can be replaced with variational wavelet regression. for a readable introduction standard wavelet methods, see wasserman (2006)[***]<3>. we rescale time so it is between 0 and 1 influence:2 type:1 pair index:508 citer id:452 citer title:dynamic topic models citer abstract:a family of probabilistic time series models is developed to analyze the time evolution of topics in large document collections. the approach is to use state space models on the natural parameters of the multinomial distributions that represent the topics. variational approximations based on kalman filters and nonparametric wavelet regression are developed to carry out approximate posterior inference over the latent topics. in addition to giving quantitative, predictive models of a sequential corpus, dynamic topic models provide a qualitative window into the contents of a large document collection. the models are demonstrated by analyzing the ocred archives of the journal science from 1880 through 2000 citee id:276 citee title:bayesian forecasting and dynamic models citee abstract:about this textbook the second edition of this book includes revised, updated, and additional material on the structure, theory, and application of classes of dynamic models in bayesian time series analysis and forecasting. in addition to wide ranging updates to central material, the second edition includes many more exercises and covers new topics at the research and application frontiers of bayesian forecastings. surrounding text:each topics natural parameters fit,k evolve over time, together with the mean parameters t of the logistic normal distribution for the topic proportions. tion (aitchison, 1982)[1]<1> to time-series simplex data (west and harrison, 1997)[***]<1>. in lda, the document-specific topic proportions  are drawn from a dirichlet distribution influence:2 type:1 pair index:509 citer id:453 citer title:the author-recipient-topic model for topic and role discovery in social networks: experiments with enron and academic email citer abstract:previous work in social network analysis (sna) has modeled the existence of links from one entity to another, but not the language content or topics on those links. we present the author-recipient-topic (art) model for social network analysis, which learns topic distributions based on the the directionsensitive messages sent between entities. the model builds on latent dirichlet allocation and the author-topic (at) model, adding the key attribute that distribution over topics is conditioned distinctly on both the sender and recipientsteering the discovery of topics according to the relationships between people. we give results on both the enron email corpus and a researchers email archive, providing evidence not only that clearly relevant topics are discovered, but that the art model better predicts peoples roles citee id:622 citee title:how to search a social network citee abstract:we address the question of how participants in a small world experiment are able to find short paths in a social network using only local information about their immediate contactsfi we simulate such experiments on a network of actual email contacts within an organization as well as on a student social networking websitefi on the email network we find that small world search strategies using a contact's position in physical space or in an organizational hierarchy relative to the target can efiectively be used to locate most individualsfi however, we find that in the on-line student network, where the data is incomplete and hierarchical structures are not well defined, local search strategies are less efiectivefi we compare our findings to recent theoretical hypotheses about underlying social structure that would en-able these simple search strategies to succeed and discuss the implications to social software designfi surrounding text:this finding serves as an example that network properties are not sufficient to optimize flow on an email network. adamic and adar (2004)[***]<2> studied the efficiency of local information search strategies on social networks. they find that in the case of an email network at hp labs, a greedy search strategy works efficiently as predicted by kleinberg (2000)[6]<2> and watts et al influence:2 type:3 pair index:510 citer id:453 citer title:the author-recipient-topic model for topic and role discovery in social networks: experiments with enron and academic email citer abstract:previous work in social network analysis (sna) has modeled the existence of links from one entity to another, but not the language content or topics on those links. we present the author-recipient-topic (art) model for social network analysis, which learns topic distributions based on the the directionsensitive messages sent between entities. the model builds on latent dirichlet allocation and the author-topic (at) model, adding the key attribute that distribution over topics is conditioned distinctly on both the sender and recipientsteering the discovery of topics according to the relationships between people. we give results on both the enron email corpus and a researchers email archive, providing evidence not only that clearly relevant topics are discovered, but that the art model better predicts peoples roles citee id:164 citee title:latent dirichlet allocation citee abstract:we describe latent dirichlet allocation (lda), a generative probabilistic model for collections of discrete data such as text corpora. lda is a three-level hierarchical bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. in the context of text modeling, the topic probabilities provide an explicit representation of a document. we present efficient approximate inference techniques based on variational methods and an em algorithm for empirical bayes parameter estimation. we report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic lsi model surrounding text:latent dirichlet allocation (blei et al. , 2003)[***]<1> robustly discovers multinomial word distributions of these topics. hierarchical dirichlet processes (teh et al. latent dirichlet allocation (lda) is a bayesian network that generates a document using a mixture of topics (blei et al. , 2003)[***]<1>. in its generative process, for each document d, a multinomial distribution  over topics is randomly 1the clustering may either external to the model by simple greedy-agglomerative clustering, or internal to the model by introducing latent variables for the senders and recipients roles, as described in the role-author-recipient-topic (rart) model toward the end of this paper. three standard approximations have been used to obtain practical results: variational methods (blei et al. , 2003)[***]<1>, gibbs sampling (griffiths & steyvers, 2004. steyvers et al. as discussed above, the art model is a direct offspring of latent dirichlet allocation (blei et al. , 2003)[***]<1>, the multi-label mixture model (mccallum, 1999)[8]<1>, and the author-topic model (steyvers et al. , 2004 influence:1 type:1 pair index:511 citer id:453 citer title:the author-recipient-topic model for topic and role discovery in social networks: experiments with enron and academic email citer abstract:previous work in social network analysis (sna) has modeled the existence of links from one entity to another, but not the language content or topics on those links. we present the author-recipient-topic (art) model for social network analysis, which learns topic distributions based on the the directionsensitive messages sent between entities. the model builds on latent dirichlet allocation and the author-topic (at) model, adding the key attribute that distribution over topics is conditioned distinctly on both the sender and recipientsteering the discovery of topics according to the relationships between people. we give results on both the enron email corpus and a researchers email archive, providing evidence not only that clearly relevant topics are discovered, but that the art model better predicts peoples roles citee id:767 citee title:navigation in a small world citee abstract:the small-world phenomenon the principle that most of us are linked by short chains of acquaintances was first investigated as a question in sociology1, 2 and is a feature of a range of networks arising in nature and technology3, 4, 5. experimental study of the phenomenon1 revealed that it has two fundamental components: first, such short chains are ubiquitous, and second, individuals operating with purely local information are very adept at finding these chains. the first issue has been analysed2, 3, 4, and here i investigate the second by modelling how individuals can find short chains in a large social network. top of page surrounding text:adamic and adar (2004)[1]<2> studied the efficiency of local information search strategies on social networks. they find that in the case of an email network at hp labs, a greedy search strategy works efficiently as predicted by kleinberg (2000)[***]<2> and watts et al. (2002)[16]<2> influence:2 type:1 pair index:512 citer id:453 citer title:the author-recipient-topic model for topic and role discovery in social networks: experiments with enron and academic email citer abstract:previous work in social network analysis (sna) has modeled the existence of links from one entity to another, but not the language content or topics on those links. we present the author-recipient-topic (art) model for social network analysis, which learns topic distributions based on the the directionsensitive messages sent between entities. the model builds on latent dirichlet allocation and the author-topic (at) model, adding the key attribute that distribution over topics is conditioned distinctly on both the sender and recipientsteering the discovery of topics according to the relationships between people. we give results on both the enron email corpus and a researchers email archive, providing evidence not only that clearly relevant topics are discovered, but that the art model better predicts peoples roles citee id:853 citee title:the structural equivalence of individuals in social networks citee abstract:there are two problems in structural sociology that have attracted the attention of researchers: first, the creation of individual identity by means of the social networks in which individuals are embedded; second, that of structural equivalence, i.e. the problem of non-identification of individuals by means of those social networks. the aim of this article is to bring both problems together under a common framework from which both identity and equivalence of individuals in social networks will simultaneously emerge. or in other words, the study attempts to show that structural determinations of identity and equivalence are just the same. in order to achieve this, it is necessary to go beyond the duality approach insofar as duality represents social structure either as a network of interconnected social circles or as a network of individuals who are linked by one or some given relations. accordingly, the concept of place and place analysis is used. surrounding text:i, x,w) / nwv zi + fiv pv nwv zi + fiv nzi xi + zi pz0 nz0 xi + z0 p(xi z, x. i,w) / nzi xi + zi pz0 nz0 xi + z0 3 related work the use of social networks to discover roles for the people (or nodes) in the network goes back over three decades to the work of lorrain and white (1971)[***]<1>. it is based on the hypothesis that nodes on a network that relate to other nodes in equivalent ways must have the same role influence:2 type:3 pair index:513 citer id:453 citer title:the author-recipient-topic model for topic and role discovery in social networks: experiments with enron and academic email citer abstract:previous work in social network analysis (sna) has modeled the existence of links from one entity to another, but not the language content or topics on those links. we present the author-recipient-topic (art) model for social network analysis, which learns topic distributions based on the the directionsensitive messages sent between entities. the model builds on latent dirichlet allocation and the author-topic (at) model, adding the key attribute that distribution over topics is conditioned distinctly on both the sender and recipientsteering the discovery of topics according to the relationships between people. we give results on both the enron email corpus and a researchers email archive, providing evidence not only that clearly relevant topics are discovered, but that the art model better predicts peoples roles citee id:759 citee title:multi-label text classification with a mixture model trained by em citee abstract:in many important document classification tasks, documents may each be associated with multiple class labelsfi this paper describes a bayesian classification approach in which the multiple classes that comprise a document are represented by a mixture modelfi while the labeled training data indicates which classes were responsible for generating a document, it does not indicate which class was responsible for generating each wordfi thus we use em to fill in this missing value, learning both the surrounding text:the robustness of the model is greatly enhanced by integrated out uncertainty about the per-document topic distribution . the author model (also termed a multi-label mixture model) (mccallum, 1999)[***]<2>, is a bayesian network that simultaneously models document content and its authors interests with a one-to-one correspondence between topics and authors. the model was originally applied to multi-label document classification, with categories acting as authors influence:2 type:1 pair index:514 citer id:453 citer title:the author-recipient-topic model for topic and role discovery in social networks: experiments with enron and academic email citer abstract:previous work in social network analysis (sna) has modeled the existence of links from one entity to another, but not the language content or topics on those links. we present the author-recipient-topic (art) model for social network analysis, which learns topic distributions based on the the directionsensitive messages sent between entities. the model builds on latent dirichlet allocation and the author-topic (at) model, adding the key attribute that distribution over topics is conditioned distinctly on both the sender and recipientsteering the discovery of topics according to the relationships between people. we give results on both the enron email corpus and a researchers email archive, providing evidence not only that clearly relevant topics are discovered, but that the art model better predicts peoples roles citee id:533 citee title:expectation-propagation for the generative aspect model citee abstract:the generative aspect model is an extension of the multinomial model for text that allows word probabilities to vary stochastically across documents surrounding text:, 2004)[4][10][12]<1>, and expectation propagation (griffiths & steyvers, 2004. minka & lafferty, 2002)[4][***]<1>. we chose gibbs sampling for its ease of implementation influence:3 type:1 pair index:515 citer id:453 citer title:the author-recipient-topic model for topic and role discovery in social networks: experiments with enron and academic email citer abstract:previous work in social network analysis (sna) has modeled the existence of links from one entity to another, but not the language content or topics on those links. we present the author-recipient-topic (art) model for social network analysis, which learns topic distributions based on the the directionsensitive messages sent between entities. the model builds on latent dirichlet allocation and the author-topic (at) model, adding the key attribute that distribution over topics is conditioned distinctly on both the sender and recipientsteering the discovery of topics according to the relationships between people. we give results on both the enron email corpus and a researchers email archive, providing evidence not only that clearly relevant topics are discovered, but that the art model better predicts peoples roles citee id:825 citee title:probabilistic author-topic models for information discovery citee abstract:we propose a new unsupervised learning technique for extracting information from large text collectionsfi we model documents as if they were generated by a two-stage stochastic processfi each author is represented by a probability distribution over topics, and each topic is represented as a probability distribution over words for that topicfi the words in a multi-author paper are assumed to be the result of a mixture of each authors' topic mixturefi the topic-word and author-topic distributions are learned from data in an unsupervised manner using a markov chain monte carlo algorithmfi we apply the methodology to a large corpus of 160,000 s and 85,000 authors from the well-known citeseer digital library, and learn a model with 300 topicsfi we discuss in detail the interpretation of the results discovered by the system including specific topic and author models, ranking of authors by topic and topics by author, significant trends in the computer science literature between 1990 and 2002, parsing of abstracts by topics and authors and detection of unusual papers by specific authorsfi an online query interface to the model is also discussed that allows interactive exploration of author-topic models for corpora such as citeseerfi surrounding text:rosen-zvi et al. , 2004)[10][***]<1> learns topics conditioned on the mixture of authors that composed a document. however, none of these models are appropriate for social network analysis, in which we aim to capture the directed interactions and relationships between people. rosen-zvi et al. , 2004)[4][10][***]<1>, and expectation propagation (griffiths & steyvers, 2004. minka & lafferty, 2002)[4][9]<1>. rosen-zvi et al. , 2004)[10][***]<1>, with the distinction that art is specifically designed to capture language used in a directed network of correspondents. 4 experimental results we present results with the enron email corpus and the personal email of one of the authors of this paper (mccallum) influence:1 type:1 pair index:516 citer id:453 citer title:the author-recipient-topic model for topic and role discovery in social networks: experiments with enron and academic email citer abstract:previous work in social network analysis (sna) has modeled the existence of links from one entity to another, but not the language content or topics on those links. we present the author-recipient-topic (art) model for social network analysis, which learns topic distributions based on the the directionsensitive messages sent between entities. the model builds on latent dirichlet allocation and the author-topic (at) model, adding the key attribute that distribution over topics is conditioned distinctly on both the sender and recipientsteering the discovery of topics according to the relationships between people. we give results on both the enron email corpus and a researchers email archive, providing evidence not only that clearly relevant topics are discovered, but that the art model better predicts peoples roles citee id:437 citee title:hierarchical dirichlet processes citee abstract:we consider problems involving groups of data, where each observation within a group is a draw from a mixture model, and where it is desirable to share mixture components between groups. we assume that the number of mixture components is unknown a priori and is to be inferred from the data. in this setting it is natural to consider sets of dirichlet processes, one for each group, where the well-known clustering property of the dirichlet process provides a nonparametric prior for the number of mixture components within each group. given our desire to tie the mixture models in the various groups, we consider a hierarchical model, speci.cally one in which the base measure for the child dirichlet processes is itself distributed according to a dirichlet process. such a base measure being discrete, the child dirichlet processes necessar-ily share atoms. thus, as desired, the mixture models in the different groups necessarily share mixture components. we discuss representations of hierarchical dirichlet processes in terms of a stick-breaking process, and a generalization of the chinese restaurant process that we refer to as the chinese restaurant franchise. we present markov chain monte carlo algorithms for posterior inference in hierarchical dirichlet process mixtures, and describe applications to problems in information retrieval and text modelling surrounding text:hierarchical dirichlet processes (teh et al. , 2004)[***]<1> can determine an appropriate number of topics for a corpus. the author-topic model (steyvers et al influence:3 type:1 pair index:517 citer id:453 citer title:the author-recipient-topic model for topic and role discovery in social networks: experiments with enron and academic email citer abstract:previous work in social network analysis (sna) has modeled the existence of links from one entity to another, but not the language content or topics on those links. we present the author-recipient-topic (art) model for social network analysis, which learns topic distributions based on the the directionsensitive messages sent between entities. the model builds on latent dirichlet allocation and the author-topic (at) model, adding the key attribute that distribution over topics is conditioned distinctly on both the sender and recipientsteering the discovery of topics according to the relationships between people. we give results on both the enron email corpus and a researchers email archive, providing evidence not only that clearly relevant topics are discovered, but that the art model better predicts peoples roles citee id:623 citee title:identify and search in social networks citee abstract:social networks have the surprising property of being \searchable": ordinary people are capable of directing messages through their network of acquaintances to reach a specific but distant target person in only a few steps. we present a model that o ers an explanation of social network searchability in terms of recognizable personal identities defined along a number of social dimensions. our model defines a class of searchable networks and a method for searching them that may be applicable to many network search problems including the location of data files in peer-to-peer networks, pages on the world wide web, and information in distributed databases. surrounding text:they find that in the case of an email network at hp labs, a greedy search strategy works efficiently as predicted by kleinberg (2000)[6]<2> and watts et al. (2002)[***]<2>. all these approaches, however, limit themselves to the use of network topology to discover roles influence:2 type:3 pair index:518 citer id:453 citer title:the author-recipient-topic model for topic and role discovery in social networks: experiments with enron and academic email citer abstract:previous work in social network analysis (sna) has modeled the existence of links from one entity to another, but not the language content or topics on those links. we present the author-recipient-topic (art) model for social network analysis, which learns topic distributions based on the the directionsensitive messages sent between entities. the model builds on latent dirichlet allocation and the author-topic (at) model, adding the key attribute that distribution over topics is conditioned distinctly on both the sender and recipientsteering the discovery of topics according to the relationships between people. we give results on both the enron email corpus and a researchers email archive, providing evidence not only that clearly relevant topics are discovered, but that the art model better predicts peoples roles citee id:657 citee title:information flow in social groups citee abstract:we present a study of information ow that takes into account the observation that an item relevant to one person is more likely to be of interest to individuals in the same social circle than those outside of it. this is due to the fact that the similarity of node attributes in social networks decreases as a function of the graph distance. an epidemic model on a scale-free network with this property has a finite threshold, implying that the spread of information is limited. we tested our predictions by measuring the spread of messages in an organization and also by numerical experiments that take into consideration the organizational distance among individuals surrounding text:with the recent availability of large datasets of human interactions (shetty & adibi, 2004. wu et al$, 2003)[11][***]<3>, the popularity of services like friendster. com and linkedin influence:2 type:3 pair index:519 citer id:454 citer title:the author-topic model for authors and documents citer abstract:we introduce the author-topic model, a generative model for documents that extends latent dirichlet allocation (lda; blei, ng, & jordan, 2003) to include authorship information. each author is associated with a multinomial distribution over topics and each topic is associated with a multinomial distribution over words. a document with multiple authors is modeled as a distribution over topics that is a mixture of the distributions associated with the authors. we apply the model to a collection of 1,700 nips conference papers and 160,000 citeseer s. exact inference is intractable for these datasets and we use gibbs sampling to estimate the topic and author distributions. we compare the performance with two other generative models for documents, which are special cases of the author-topic model: lda (a topic model) and a simple author model in which each author is associated with a distribution over words rather than a distribution over topics. we show topics recovered by the authortopic model, and demonstrate applications to computing similarity between authors and entropy of author output citee id:164 citee title:latent dirichlet allocation citee abstract:we describe latent dirichlet allocation (lda), a generative probabilistic model for collections of discrete data such as text corpora. lda is a three-level hierarchical bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. in the context of text modeling, the topic probabilities provide an explicit representation of a document. we present efficient approximate inference techniques based on variational methods and an em algorithm for empirical bayes parameter estimation. we report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic lsi model surrounding text:of computer science uc irvine irvine, ca 92697-3425, usa abstract we introduce the author-topic model, a generative model for documents that extends latent dirichlet allocation (lda. blei, ng, & jordan, 2003)[***]<1> to include authorship information. each author is associated with a multinomial distribution over topics and each topic is associated with a multinomial distribution over words. , blei, ng, & jordan, 2003. hofmann, 1999)[***]<1>[5]<1>. here, we consider how these approaches can be used to address another fundamental problem raised by large document collections: modeling the interests of authors. in this paper we describe a generative model for document collections, the author-topic model, that simultaneously models the content of documents and the interests of authors. this generative model represents each document with a mixture of topics, as in stateof-the-art approaches like latent dirichlet allocation (blei et al$, 2003)[***]<1>, and extends these approaches to author modeling by allowing the mixture weights for di erent topics to be determined by the authors of the document. by learning the parameters of the model, we obtain the set of topics that appear in a corpus and their relevance to di erent documents, as well as identifying which topics are used by which authors. 2 generative models for documents we will describe three generative models for documents: one that models documents as a mixture of topics (blei et al. , 2003)[***]<1>, one that models authors with distributions over words, and one that models both authors and documents using topics. all three models use the same notation. , 2003. hofmann, 1999)[***]<1>[5]<1>. we will describe one such model { latent dirichlet allocation (lda. blei et al. , 2003)[***]<1>. 1 in lda, the generation of a document collection is modeled as a three step process. a variety of algorithms have been used 1the model we describe is actually the smoothed lda model (blei et al. , 2003)[***]<1> with symmetric dirichlet priors (griffiths & steyvers, 2004)[4]<1> as this is closest to the authortopic model. to estimate these parameters, including variational inference (blei et al. to estimate these parameters, including variational inference (blei et al. , 2003)[***]<1>, expectation propagation (minka & la erty, 2002)[9]<1>, and gibbs sampling (griffiths & steyvers, 2004)[4]<1>. however, this topic model provides no explicit information about the interests of authors: while it is informative about the content of documents, authors may produce several documents { often with co-authors { and it is consequently unclear how the topics used in these documents might be used to describe the interests of the authors. blei et al. , 2003)[***]<1>, a topic model. (b) an author model. hofmann, 1999)[5]<1>, to approximate inference methods like variational em (blei et al. , 2003)[***]<1>, expectation propagation (minka & la erty, 2002)[9]<1>, and gibbs sampling (griffiths & steyvers, 2004)[4]<1>. generic em algorithms tend to face problems with local maxima in these models (blei et al. generic em algorithms tend to face problems with local maxima in these models (blei et al. , 2003)[***]<1>, suggesting a move to approximate methods in which some of the parameterssuch as  and can be integrated out rather than explicitly estimated. in this paper, we will use gibbs sampling, as it provides a simple method for obtaining parameter estimates under dirichlet priors and allows combination of estimates from several local maxima of the posterior distribution influence:1 type:1 pair index:520 citer id:454 citer title:the author-topic model for authors and documents citer abstract:we introduce the author-topic model, a generative model for documents that extends latent dirichlet allocation (lda; blei, ng, & jordan, 2003) to include authorship information. each author is associated with a multinomial distribution over topics and each topic is associated with a multinomial distribution over words. a document with multiple authors is modeled as a distribution over topics that is a mixture of the distributions associated with the authors. we apply the model to a collection of 1,700 nips conference papers and 160,000 citeseer s. exact inference is intractable for these datasets and we use gibbs sampling to estimate the topic and author distributions. we compare the performance with two other generative models for documents, which are special cases of the author-topic model: lda (a topic model) and a simple author model in which each author is associated with a distribution over words rather than a distribution over topics. we show topics recovered by the authortopic model, and demonstrate applications to computing similarity between authors and entropy of author output citee id:855 citee title:the missing link: a probabilistic model of document content and hypertext connectivity citee abstract:we describe a joint probabilistic model for modeling the contents and inter-connectivity of document collections such as sets of web pages or research paper archives. the model is based on a probabilistic factor decomposition and allows identifying principal topics of the collection as well as authoritative documents within those topics. furthermore, the relationships between topics is mapped out in order to build a predictive model of link content. among the many applications of this approach are information retrieval and search, topic identification, query disambiguation, focused web crawling, web authoring, and bibliometric analysis. surrounding text:f. cohn & hofmann, 2001)[***]<3>, combining topic models with stylometry models for author identification, and applications such as automated reviewer list generation given sets of documents for review. acknowledgements the research in this paper was supported in part by the national science foundation under grant iri-9703120 via the knowledge discovery and dissemination (kdd) program influence:2 type:2 pair index:521 citer id:454 citer title:the author-topic model for authors and documents citer abstract:we introduce the author-topic model, a generative model for documents that extends latent dirichlet allocation (lda; blei, ng, & jordan, 2003) to include authorship information. each author is associated with a multinomial distribution over topics and each topic is associated with a multinomial distribution over words. a document with multiple authors is modeled as a distribution over topics that is a mixture of the distributions associated with the authors. we apply the model to a collection of 1,700 nips conference papers and 160,000 citeseer s. exact inference is intractable for these datasets and we use gibbs sampling to estimate the topic and author distributions. we compare the performance with two other generative models for documents, which are special cases of the author-topic model: lda (a topic model) and a simple author model in which each author is associated with a distribution over words rather than a distribution over topics. we show topics recovered by the authortopic model, and demonstrate applications to computing similarity between authors and entropy of author output citee id:660 citee title:markov chain monte carlo in practice citee abstract:markov chain monte carlo (mcmc) methods make possi-ble the use of flexible bayesian models that would other-wise be computationally infeasible. in recent years, a greatvariety of such applications have been described in the lit-erature. applied statisticians who are new to these methodsmay have several questions and concerns, however: howmuch effort and expertise are needed to design and use amarkov chain sampler? how much confidence can one havein the answers that mcmc produces? how does the use ofmcmc affect the rest of the model-building process? atthe joint statistical meetings in august, 1996, a panel ofexperienced mcmc users discussed these and other issues,as well as various tricks of the trade. this article is anedited recreation of that discussion. its purpose is to offeradvice and guidance to novice users of mcmcand to not-so-novice users as well. topics include building confidencein simulation results, methods for speeding and assessingconvergence, estimating standard errors, identification ofmodels for which good mcmc algorithms exist, and thecurrent state of software development. surrounding text:the lda model has two sets of unknown parameters { the d document distributions , and the t topic distributions  { as well as the latent variables corresponding to the assignments of individual words to topics z. by applying gibbs sampling (see gilks, richardson, & spiegelhalter, 1996)[***]<1>, we construct a markov chain that converges to the posterior distribution on z and then use the results to infer  and  (griffiths & steyvers, 2004)[4]<1>. the transition between successive states of the markov chain results from repeatedly drawing z from its distribution conditioned on all other variables, summing out  and  using standard dirichlet integrals: p (zi = jjwi = m influence:2 type:1 pair index:522 citer id:454 citer title:the author-topic model for authors and documents citer abstract:we introduce the author-topic model, a generative model for documents that extends latent dirichlet allocation (lda; blei, ng, & jordan, 2003) to include authorship information. each author is associated with a multinomial distribution over topics and each topic is associated with a multinomial distribution over words. a document with multiple authors is modeled as a distribution over topics that is a mixture of the distributions associated with the authors. we apply the model to a collection of 1,700 nips conference papers and 160,000 citeseer s. exact inference is intractable for these datasets and we use gibbs sampling to estimate the topic and author distributions. we compare the performance with two other generative models for documents, which are special cases of the author-topic model: lda (a topic model) and a simple author model in which each author is associated with a distribution over words rather than a distribution over topics. we show topics recovered by the authortopic model, and demonstrate applications to computing similarity between authors and entropy of author output citee id:426 citee title:finding scientific topics citee abstract:a first step in identifying the content of a document is determining which topics that document addresses. we describe a generative model for documents, introduced by blei, ng, and jordan , in which each document is generated by choosing a distribution over topics and then choosing each word in the document from a topic selected according to this distribution. we then present a markov chain monte carlo algorithm for inference in this model. we use this algorithm to analyze s from pnas by using bayesian model selection to establish the number of topics. we show that the extracted topics capture meaningful structure in the data, consistent with the class designations provided by the authors of the articles, and outline further applications of this analysis, including identifying hot topics by examining temporal dynamics and tagging abstracts to illustrate semantic content. surrounding text:a variety of algorithms have been used 1the model we describe is actually the smoothed lda model (blei et al. , 2003)[1]<1> with symmetric dirichlet priors (griffiths & steyvers, 2004)[***]<1> as this is closest to the authortopic model. to estimate these parameters, including variational inference (blei et al. to estimate these parameters, including variational inference (blei et al. , 2003)[1]<1>, expectation propagation (minka & la erty, 2002)[9]<1>, and gibbs sampling (griffiths & steyvers, 2004)[***]<1>. however, this topic model provides no explicit information about the interests of authors: while it is informative about the content of documents, authors may produce several documents { often with co-authors { and it is consequently unclear how the topics used in these documents might be used to describe the interests of the authors. hofmann, 1999)[5]<1>, to approximate inference methods like variational em (blei et al. , 2003)[1]<1>, expectation propagation (minka & la erty, 2002)[9]<1>, and gibbs sampling (griffiths & steyvers, 2004)[***]<1>. generic em algorithms tend to face problems with local maxima in these models (blei et al. the lda model has two sets of unknown parameters { the d document distributions , and the t topic distributions  { as well as the latent variables corresponding to the assignments of individual words to topics z. by applying gibbs sampling (see gilks, richardson, & spiegelhalter, 1996)[3]<1>, we construct a markov chain that converges to the posterior distribution on z and then use the results to infer  and  (griffiths & steyvers, 2004)[***]<1>. the transition between successive states of the markov chain results from repeatedly drawing z from its distribution conditioned on all other variables, summing out  and  using standard dirichlet integrals: p (zi = jjwi = m influence:1 type:1 pair index:523 citer id:454 citer title:the author-topic model for authors and documents citer abstract:we introduce the author-topic model, a generative model for documents that extends latent dirichlet allocation (lda; blei, ng, & jordan, 2003) to include authorship information. each author is associated with a multinomial distribution over topics and each topic is associated with a multinomial distribution over words. a document with multiple authors is modeled as a distribution over topics that is a mixture of the distributions associated with the authors. we apply the model to a collection of 1,700 nips conference papers and 160,000 citeseer s. exact inference is intractable for these datasets and we use gibbs sampling to estimate the topic and author distributions. we compare the performance with two other generative models for documents, which are special cases of the author-topic model: lda (a topic model) and a simple author model in which each author is associated with a distribution over words rather than a distribution over topics. we show topics recovered by the authortopic model, and demonstrate applications to computing similarity between authors and entropy of author output citee id:427 citee title:probabilistic latent semantic indexing citee abstract:probabilistic latent semantic indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data. fitted from a training corpus of text documents by a generalization of the expectation maximization algorithm, the utilized model is able to deal with domain{specific synonymy as well as with polysemous words. in contrast to standard latent semantic indexing (lsi) by singular value decomposition, the probabilistic variant has a solid statistical foundation and defines a proper generative data model. retrieval experiments on a number of test collections indicate substantial performance gains over direct term matching methods as well as over lsi. in particular, the combination of models with di erent dimensionalities has proven to be advantageous surrounding text:, blei, ng, & jordan, 2003. hofmann, 1999)[1]<1>[***]<1>. here, we consider how these approaches can be used to address another fundamental problem raised by large document collections: modeling the interests of authors. , 2003. hofmann, 1999)[1]<1>[***]<1>. we will describe one such model { latent dirichlet allocation (lda. 3 gibbs sampling algorithms a variety of algorithms have been used to estimate the parameters of topic models, from basic expectationmaximization (em. hofmann, 1999)[***]<1>, to approximate inference methods like variational em (blei et al. , 2003)[1]<1>, expectation propagation (minka & la erty, 2002)[9]<1>, and gibbs sampling (griffiths & steyvers, 2004)[4]<1> influence:1 type:1 pair index:524 citer id:454 citer title:the author-topic model for authors and documents citer abstract:we introduce the author-topic model, a generative model for documents that extends latent dirichlet allocation (lda; blei, ng, & jordan, 2003) to include authorship information. each author is associated with a multinomial distribution over topics and each topic is associated with a multinomial distribution over words. a document with multiple authors is modeled as a distribution over topics that is a mixture of the distributions associated with the authors. we apply the model to a collection of 1,700 nips conference papers and 160,000 citeseer s. exact inference is intractable for these datasets and we use gibbs sampling to estimate the topic and author distributions. we compare the performance with two other generative models for documents, which are special cases of the author-topic model: lda (a topic model) and a simple author model in which each author is associated with a distribution over words rather than a distribution over topics. we show topics recovered by the authortopic model, and demonstrate applications to computing similarity between authors and entropy of author output citee id:856 citee title:the federalist revisited: new directions in authorship attribution citee abstract:the federalist papers, twelve of which are claimed by both alexander hamilton and james madison, have long been used as a testing-ground for authorship attribution techniques despite the fact that the styles of hamilton and madison are unusually similar. this paper assesses the value of three novel stylometric techniques by applying them to the federalist problem. the techniques examined are a multivariate approach to vocabulary richness, analysis of the frequencies of occurrence of sets of common high-frequency words, and use of a machine-learning package based on a genetic algorithm to seek relational expressions characterizing authorial styles. all three approaches produce encouraging results to what is acknowledged to be a difficult problem. surrounding text:g. , holmes & forsyth, 1995)[***]<2> finds stylistic features (e. g influence:3 type:2 pair index:525 citer id:454 citer title:the author-topic model for authors and documents citer abstract:we introduce the author-topic model, a generative model for documents that extends latent dirichlet allocation (lda; blei, ng, & jordan, 2003) to include authorship information. each author is associated with a multinomial distribution over topics and each topic is associated with a multinomial distribution over words. a document with multiple authors is modeled as a distribution over topics that is a mixture of the distributions associated with the authors. we apply the model to a collection of 1,700 nips conference papers and 160,000 citeseer s. exact inference is intractable for these datasets and we use gibbs sampling to estimate the topic and author distributions. we compare the performance with two other generative models for documents, which are special cases of the author-topic model: lda (a topic model) and a simple author model in which each author is associated with a distribution over words rather than a distribution over topics. we show topics recovered by the authortopic model, and demonstrate applications to computing similarity between authors and entropy of author output citee id:416 citee title:digital libraries and autonomous citation indexing citee abstract:the world wide web is revolutionizing the way that researchers access scientific informationfi articles are increasingly being made available on the homepages of authors or institutions, at journal web sites, or in online archivesfi however, scientific information on the web is largely disorganizedfi this article introduces the creation of digital libraries incorporating autonomous citation indexing (aci)fi aci autonomously creates citation indices similar to the science citation index r fi an aci surrounding text:we do this 10 times so that 10 samples are collected in this manner (the markov chain is started 10 times from random initial assignments). 4 experimental results in our results we used two text data sets consisting of technical papersfull papers from the nips conference2 and abstracts from citeseer (lawrence, giles, & bollacker, 1999)[***]<3>. we removed extremely common words from each corpus, a standard procedure in \bag of words" models influence:3 type:3 pair index:526 citer id:454 citer title:the author-topic model for authors and documents citer abstract:we introduce the author-topic model, a generative model for documents that extends latent dirichlet allocation (lda; blei, ng, & jordan, 2003) to include authorship information. each author is associated with a multinomial distribution over topics and each topic is associated with a multinomial distribution over words. a document with multiple authors is modeled as a distribution over topics that is a mixture of the distributions associated with the authors. we apply the model to a collection of 1,700 nips conference papers and 160,000 citeseer s. exact inference is intractable for these datasets and we use gibbs sampling to estimate the topic and author distributions. we compare the performance with two other generative models for documents, which are special cases of the author-topic model: lda (a topic model) and a simple author model in which each author is associated with a distribution over words rather than a distribution over topics. we show topics recovered by the authortopic model, and demonstrate applications to computing similarity between authors and entropy of author output citee id:759 citee title:multi-label text classification with a mixture model trained by em citee abstract:in many important document classification tasks, documents may each be associated with multiple class labelsfi this paper describes a bayesian classification approach in which the multiple classes that comprise a document are represented by a mixture modelfi while the labeled training data indicates which classes were responsible for generating a document, it does not indicate which class was responsible for generating each wordfi thus we use em to fill in this missing value, learning both the surrounding text:for each word in the document an author is chosen uniformly at random, and a word is chosen from a probability distribution over words that is specific to that author. this model is similar to a mixture model proposed by mccallum (1999)[***]<2> and is equivalent to a variant of lda in which the mixture weights for the di erent topics are fixed. the underlying graphical model is shown in figure 1(b) influence:1 type:2 pair index:527 citer id:454 citer title:the author-topic model for authors and documents citer abstract:we introduce the author-topic model, a generative model for documents that extends latent dirichlet allocation (lda; blei, ng, & jordan, 2003) to include authorship information. each author is associated with a multinomial distribution over topics and each topic is associated with a multinomial distribution over words. a document with multiple authors is modeled as a distribution over topics that is a mixture of the distributions associated with the authors. we apply the model to a collection of 1,700 nips conference papers and 160,000 citeseer s. exact inference is intractable for these datasets and we use gibbs sampling to estimate the topic and author distributions. we compare the performance with two other generative models for documents, which are special cases of the author-topic model: lda (a topic model) and a simple author model in which each author is associated with a distribution over words rather than a distribution over topics. we show topics recovered by the authortopic model, and demonstrate applications to computing similarity between authors and entropy of author output citee id:533 citee title:expectation-propagation for the generative aspect model citee abstract:the generative aspect model is an extension of the multinomial model for text that allows word probabilities to vary stochastically across documents surrounding text:to estimate these parameters, including variational inference (blei et al. , 2003)[1]<1>, expectation propagation (minka & la erty, 2002)[***]<1>, and gibbs sampling (griffiths & steyvers, 2004)[4]<1>. however, this topic model provides no explicit information about the interests of authors: while it is informative about the content of documents, authors may produce several documents { often with co-authors { and it is consequently unclear how the topics used in these documents might be used to describe the interests of the authors. hofmann, 1999)[5]<1>, to approximate inference methods like variational em (blei et al. , 2003)[1]<1>, expectation propagation (minka & la erty, 2002)[***]<1>, and gibbs sampling (griffiths & steyvers, 2004)[4]<1>. generic em algorithms tend to face problems with local maxima in these models (blei et al influence:3 type:1 pair index:528 citer id:462 citer title:efficient distributed skylining for web information systems citer abstract:though skyline queries already have claimed their place in retrieval over central databases, their application in web information systems up to now was impossible due to the distributed aspect of retrieval over web sources. but due to the amount, variety and volatile nature of information accessible over the internet extended query capabilities are crucial. we show how to efficiently perform distributed skyline queries and thus essentially extend the expressiveness of querying todays web information systems. together with our innovative retrieval algorithm we also present useful heuristics to further speed up the retrieval in most practical cases paving the road towards meeting even the realtime challenges of on-line information services. we discuss performance evaluations and point to open problems in the concept and application of skylining in modern information systems. for the curse of dimensionality, an intrinsic problem in skyline queries, we propose a novel sampling scheme that allows to get an early impression of the skyline for subsequent query refinement citee id:176 citee title:a situation-aware mobile traffic information prototype citee abstract:mobile information services will play an important role in our future work and private life. enabling mobility in urban and populous areas needs innovative tools and novel techniques for individual traffic planning. however, though there already are navigation systems featuring route planning, their usability is often difficult, because neither current information nor personal user preferences are incorporated. we present a prototype of a traffic information system offering advanced personalized route planning, including added services like traffic jam alerting by means of sms. the fast changing nature of such data requires to gather it from on-line internet sources. given current bandwidth limitations an asynchronous update strategy of a central service database prepares the ground to meet real-time requirements though providing up-todate information. the top results of a personalized route planning query can be efficiently computed by our srcombine algorithm in less than 3 seconds as practical case studies show. xslt technology automatically converts these results for the delivery to various mobile devices. in summary we give evidence that by intelligently joining latest technologies for preference modeling and preference query evaluation advanced personalized mobile traffic information systems are feasible today. surrounding text:consider for instance web information services accessible via mobile devices. first useful services like city guides, route planning, or restaurant booking have been developed [5]<2>, [***]<2> and generally all these services will heavily rely on information distributed over several internet sources possibly provided by independent content providers. frameworks like ntt docomos i-mode [18]<3> already provide a common platform and business model for a variety of independent content providers. g. given by [3]<2>, [5]<2> or [***]<2>. however, all these top k retrieval systems relied on a single combining function (often called utility function) that is used to compensate scores between different parts of the query. if an object is dominated by another object remove it from ki and p 5. output p as the set of all non-dominated objects for ease of understanding we show how the algorithm works for our running example: for mobile route planning in [***]<2> we have shown for the case of top k retrieval how traffic information aspects can be queried from various on-line sources. posing a query on the best route with respect to say its length (s1) and the traffic density (s2) our user employs functions that evaluate the different aspects, but is not sure how to compensate length and density. thus it is crucial for intuitive querying in the growing number of internet-based applications. distributed web information services like [5]<2> or [***]<2> are premium examples benefiting from our contributions. in contrast to traditional skylining, we presented a first algorithm that allows to retrieve the skyline over distributed data sources with basic middleware access techniques and have proven that it features an optimal complexity in terms of object accesses influence:3 type:3 pair index:529 citer id:462 citer title:efficient distributed skylining for web information systems citer abstract:though skyline queries already have claimed their place in retrieval over central databases, their application in web information systems up to now was impossible due to the distributed aspect of retrieval over web sources. but due to the amount, variety and volatile nature of information accessible over the internet extended query capabilities are crucial. we show how to efficiently perform distributed skyline queries and thus essentially extend the expressiveness of querying todays web information systems. together with our innovative retrieval algorithm we also present useful heuristics to further speed up the retrieval in most practical cases paving the road towards meeting even the realtime challenges of on-line information services. we discuss performance evaluations and point to open problems in the concept and application of skylining in modern information systems. for the curse of dimensionality, an intrinsic problem in skyline queries, we propose a novel sampling scheme that allows to get an early impression of the skyline for subsequent query refinement citee id:463 citee title:the maximin fitness function: multi-objective city and regional planning citee abstract:the maximin fitness function can be used in multi-objective genetic algorithms to obtain a diverse set of non-dominated designs. the maximin fitness function is derived from the definition of dominance, and its properties are explored. the modified maximin fitness function is proposed. both fitness functions are briefly compared to a state-of-the-art fitness function from the literature. results from a real-world multi-objective problem are presented. this problem addresses land-use and transportation planning for high-growth cities and metropolitan regions. surrounding text:g. given by [***]<2>, [5]<2> or [2]<2>. however, all these top k retrieval systems relied on a single combining function (often called utility function) that is used to compensate scores between different parts of the query influence:3 type:3 pair index:530 citer id:462 citer title:efficient distributed skylining for web information systems citer abstract:though skyline queries already have claimed their place in retrieval over central databases, their application in web information systems up to now was impossible due to the distributed aspect of retrieval over web sources. but due to the amount, variety and volatile nature of information accessible over the internet extended query capabilities are crucial. we show how to efficiently perform distributed skyline queries and thus essentially extend the expressiveness of querying todays web information systems. together with our innovative retrieval algorithm we also present useful heuristics to further speed up the retrieval in most practical cases paving the road towards meeting even the realtime challenges of on-line information services. we discuss performance evaluations and point to open problems in the concept and application of skylining in modern information systems. for the curse of dimensionality, an intrinsic problem in skyline queries, we propose a novel sampling scheme that allows to get an early impression of the skyline for subsequent query refinement citee id:248 citee title:evaluating top-k queries over web-accessible databases citee abstract:a query to a web search engine usually consists of a list of keywords, to which the search engine responds with the best or top k pages for the query. this top-k query model is prevalent over multimedia collections in general, but also over plain relational data for certain applications. for example, consider a relation with information on available restaurants, including their location, price range for one diner, and overall food rating. a user who queries such a relation might simply specify the users location and target price range, and expect in return the best 10 restaurants in terms of some combination of proximity to the user, closeness of match to the target price range, and overall food rating. processing such top-k queries efficiently is challenging for a number of reasons. one critical such reason is that, in many web applications, the relation attributes might not be available other than through external web-accessible form interfaces, which we will have to query repeatedly for a potentially large set of candidate objects. in this paper, we study how to process topk queries efficiently in this setting, where the attributes for which users specify target values might be handled by external, autonomous sources with a variety of access interfaces. we present several algorithms for processing such queries, and evaluate them thoroughly using both synthetic and real web-accessible data. surrounding text:consider for instance web information services accessible via mobile devices. first useful services like city guides, route planning, or restaurant booking have been developed [***]<2>, [2]<2> and generally all these services will heavily rely on information distributed over several internet sources possibly provided by independent content providers. frameworks like ntt docomos i-mode [18]<3> already provide a common platform and business model for a variety of independent content providers. g. given by [3]<2>, [***]<2> or [2]<2>. however, all these top k retrieval systems relied on a single combining function (often called utility function) that is used to compensate scores between different parts of the query. please note that our heuristic 1 will not affect the abstract order of complexity from our previously stated optimality results, because the maximum improvement factor over the round robin strategy can only be the number of lists (n). but, given the rather expensive costs of object accesses over the internet even small numbers of accesses saved will improve the overall run-time behavior like shown in [***]<2> or [1]<2>. thus, also improvements taking only constant factors off the algorithms complexity should be employed towards meeting real-time constraints. thus it is crucial for intuitive querying in the growing number of internet-based applications. distributed web information services like [***]<2> or [2]<2> are premium examples benefiting from our contributions. in contrast to traditional skylining, we presented a first algorithm that allows to retrieve the skyline over distributed data sources with basic middleware access techniques and have proven that it features an optimal complexity in terms of object accesses influence:2 type:2 pair index:531 citer id:462 citer title:efficient distributed skylining for web information systems citer abstract:though skyline queries already have claimed their place in retrieval over central databases, their application in web information systems up to now was impossible due to the distributed aspect of retrieval over web sources. but due to the amount, variety and volatile nature of information accessible over the internet extended query capabilities are crucial. we show how to efficiently perform distributed skyline queries and thus essentially extend the expressiveness of querying todays web information systems. together with our innovative retrieval algorithm we also present useful heuristics to further speed up the retrieval in most practical cases paving the road towards meeting even the realtime challenges of on-line information services. we discuss performance evaluations and point to open problems in the concept and application of skylining in modern information systems. for the curse of dimensionality, an intrinsic problem in skyline queries, we propose a novel sampling scheme that allows to get an early impression of the skyline for subsequent query refinement citee id:464 citee title:querying with intrinsic preferences citee abstract:the handling of user preferences is becoming an increasingly important issue in present-day information systems. among others, preferences are used for information filtering and extraction to reduce the volume of data presented to the user. they are also used to keep track of user profiles and formulate policies to improve and automate decision making. we propose a logical framework for formulating preferences and its embedding into relational query languages. the framework is simple, and entirely neutral with respect to the properties of preferences. it makes it possible to formulate different kinds of preferences and to use preferences in querying databases. we demonstrate the usefulness of the framework through numerous examples. surrounding text:recent research on web-based information systems has focused on employing middleware algorithms, where users had to specify weightings for each aspect of their query and a central compensation function was used to find the best matching objects [7]<2>, [1]<2>. the lack of expressiveness of this top k query model, however, has first been addressed by [8]<2> and with the growing incorporation of user preferences into database systems [***]<3>, [10]<3> and information services [22]<3> the limitations of the entire model became more and more obvious. this led towards the integration of so-called skyline queries (e influence:2 type:3 pair index:532 citer id:462 citer title:efficient distributed skylining for web information systems citer abstract:though skyline queries already have claimed their place in retrieval over central databases, their application in web information systems up to now was impossible due to the distributed aspect of retrieval over web sources. but due to the amount, variety and volatile nature of information accessible over the internet extended query capabilities are crucial. we show how to efficiently perform distributed skyline queries and thus essentially extend the expressiveness of querying todays web information systems. together with our innovative retrieval algorithm we also present useful heuristics to further speed up the retrieval in most practical cases paving the road towards meeting even the realtime challenges of on-line information services. we discuss performance evaluations and point to open problems in the concept and application of skylining in modern information systems. for the curse of dimensionality, an intrinsic problem in skyline queries, we propose a novel sampling scheme that allows to get an early impression of the skyline for subsequent query refinement citee id:254 citee title:optimal aggregation algorithms for middleware citee abstract:assume that each object in a database has m grades, or scores, one for eachof m attributes. for example, an object can have a color grade, that tells how red it is, and a shape grade, that tells how round it is. for each attribute, there is a sorted list, which lists each object and its grade under that attribute, sorted by grade (highest grade first). each object is assigned an overall grade, that is obtained by combining the attribute grades using a fixed monotone aggregation function, or combining rule, suchas min or average. to determine the top k objects, that is, k objects with the highest overall grades, the naive algorithm must access every object in the database, to find its grade under each attribute. fagin has given an algorithm (fagins algorithm, or fa) that is much more efficient. for some monotone aggregation functions, fa is optimal with high probability in the worst case. we analyze an elegant and remarkably simple algorithm (the threshold algorithm, or ta) that is optimal in a much stronger sense than fa. we show that ta is essentially optimal, not just for some monotone aggregation functions, but for all of them, and not just in a high-probability worst-case sense, but over every database. unlike fa, which requires large buffers (whose size may grow unboundedly as the database size grows), ta requires only a small, constant-size buffer. ta allows early stopping, which yields, in a precise sense, an approximate version of the top k answers. we distinguish two types of access: sorted access (where the middleware system obtains the grade of an object in some sorted list by proceeding through the list sequentially from the top), and random access (where the middleware system requests the grade of object in a list, and obtains it in one step). we consider the scenarios where random access is either impossible, or expensive relative to sorted access, and provide algorithms that are essentially optimal for these cases as well. r 2003 elsevier science (usa). all rights reserved. surrounding text:frameworks like ntt docomos i-mode [18]<3> already provide a common platform and business model for a variety of independent content providers. recent research on web-based information systems has focused on employing middleware algorithms, where users had to specify weightings for each aspect of their query and a central compensation function was used to find the best matching objects [***]<2>, [1]<2>. the lack of expressiveness of this top k query model, however, has first been addressed by [8]<2> and with the growing incorporation of user preferences into database systems [6]<3>, [10]<3> and information services [22]<3> the limitations of the entire model became more and more obvious influence:2 type:2 pair index:533 citer id:462 citer title:efficient distributed skylining for web information systems citer abstract:though skyline queries already have claimed their place in retrieval over central databases, their application in web information systems up to now was impossible due to the distributed aspect of retrieval over web sources. but due to the amount, variety and volatile nature of information accessible over the internet extended query capabilities are crucial. we show how to efficiently perform distributed skyline queries and thus essentially extend the expressiveness of querying todays web information systems. together with our innovative retrieval algorithm we also present useful heuristics to further speed up the retrieval in most practical cases paving the road towards meeting even the realtime challenges of on-line information services. we discuss performance evaluations and point to open problems in the concept and application of skylining in modern information systems. for the curse of dimensionality, an intrinsic problem in skyline queries, we propose a novel sampling scheme that allows to get an early impression of the skyline for subsequent query refinement citee id:465 citee title:optimizing multi-feature queries for image databases citee abstract:: in digital libraries image retrieval queries can bebased on the similarity of objects, using severalfeature attributes like shape, texture, color or textfi surrounding text:g. [7]<1>, [***]<1>, [19]<1>. especially for contentbased retrieval of multimedia data these techniques have proven to be particularly helpful. g. using the derivatives of the score distribution function in each list along the lines of [***]<1> may be employed to estimate the expected gain in each list. please note that our heuristic 1 will not affect the abstract order of complexity from our previously stated optimality results, because the maximum improvement factor over the round robin strategy can only be the number of lists (n). g. in [***]<1>. average improvement factor 0 0,5 1 1,5 2 2,5 3 5 10 3 5 10 n = 10,000 n = 100,000 database size pruned in % 0 25 50 75 100 3 5 10 3 5 10 n = 10 ,0 0 0 n = 10 0 ,0 0 0 fig influence:3 type:3 pair index:534 citer id:462 citer title:efficient distributed skylining for web information systems citer abstract:though skyline queries already have claimed their place in retrieval over central databases, their application in web information systems up to now was impossible due to the distributed aspect of retrieval over web sources. but due to the amount, variety and volatile nature of information accessible over the internet extended query capabilities are crucial. we show how to efficiently perform distributed skyline queries and thus essentially extend the expressiveness of querying todays web information systems. together with our innovative retrieval algorithm we also present useful heuristics to further speed up the retrieval in most practical cases paving the road towards meeting even the realtime challenges of on-line information services. we discuss performance evaluations and point to open problems in the concept and application of skylining in modern information systems. for the curse of dimensionality, an intrinsic problem in skyline queries, we propose a novel sampling scheme that allows to get an early impression of the skyline for subsequent query refinement citee id:257 citee title:foundations of preferences in database systems citee abstract:personalization of e-services poses new challenges to database technology, demanding a powerful and flexible modeling technique for complex preferences. preference queries have to be answered cooperatively by treating preferences as soft constraints, attempting a best possible match-making. we propose a strict partial order semantics for preferences, which closely matches people's intuition. a variety of natural and of sophisticated preferences are covered by this model. we show how to inductively construct complex preferences by means of various preference constructors. this model is the key to a new discipline called preference engineering and to a preference algebra. given the best-matches-only (bmo) query model we investigate how complex preference queries can be decomposed into simpler ones, preparing the ground for divide & conquer algorithms. standard sql and xpath can be extended seamlessly by such preferences (presented in detail in the companion paper ). we believe that this model is appropriate to extend database technology towards effective support of personalization. surrounding text:2. skyline objects and regions of domination whereas [15]<2> and the more recent extensive system in [12]<2> with an algebra for integrating the concept of pareto optimality with the top k retrieval model for preference engineering and query optimization in databases [***]<2>, are more powerful in that they do not restrict skyline queries to numerical domains, they both rely on the na. ve algorithm of quadratic complexity doing pairwise comparisons of all database objects influence:1 type:1 pair index:535 citer id:462 citer title:efficient distributed skylining for web information systems citer abstract:though skyline queries already have claimed their place in retrieval over central databases, their application in web information systems up to now was impossible due to the distributed aspect of retrieval over web sources. but due to the amount, variety and volatile nature of information accessible over the internet extended query capabilities are crucial. we show how to efficiently perform distributed skyline queries and thus essentially extend the expressiveness of querying todays web information systems. together with our innovative retrieval algorithm we also present useful heuristics to further speed up the retrieval in most practical cases paving the road towards meeting even the realtime challenges of on-line information services. we discuss performance evaluations and point to open problems in the concept and application of skylining in modern information systems. for the curse of dimensionality, an intrinsic problem in skyline queries, we propose a novel sampling scheme that allows to get an early impression of the skyline for subsequent query refinement citee id:256 citee title:preference sql - design, implementation, experiences citee abstract:current search engines can hardly cope ade-quately with fuzzy predicates defined by complex preferencesfi the biggest problem of search engines implemented with standard sql is that sql does not directly understand the notion of preferencesfi preference sql extends sql by a preference model based on strict partial orders (presented in more detail in the companion paper ), where preference queries behave like soft selection constraintsfi several built-in base preference types and the powerful pareto operator, combined with the adherence to declarative sql programming style, guarantees great programming producti-vityfi the preference sql optimizer does an effi-cient re-writing into standard sql, including a high-level implementation of the skyline opera-tor for pareto-optimal setsfi this pre-processor approach enables a seamless application surrounding text:2. skyline objects and regions of domination whereas [15]<2> and the more recent extensive system in [***]<2> with an algebra for integrating the concept of pareto optimality with the top k retrieval model for preference engineering and query optimization in databases [11]<2>, are more powerful in that they do not restrict skyline queries to numerical domains, they both rely on the na. ve algorithm of quadratic complexity doing pairwise comparisons of all database objects influence:3 type:2 pair index:536 citer id:462 citer title:efficient distributed skylining for web information systems citer abstract:though skyline queries already have claimed their place in retrieval over central databases, their application in web information systems up to now was impossible due to the distributed aspect of retrieval over web sources. but due to the amount, variety and volatile nature of information accessible over the internet extended query capabilities are crucial. we show how to efficiently perform distributed skyline queries and thus essentially extend the expressiveness of querying todays web information systems. together with our innovative retrieval algorithm we also present useful heuristics to further speed up the retrieval in most practical cases paving the road towards meeting even the realtime challenges of on-line information services. we discuss performance evaluations and point to open problems in the concept and application of skylining in modern information systems. for the curse of dimensionality, an intrinsic problem in skyline queries, we propose a novel sampling scheme that allows to get an early impression of the skyline for subsequent query refinement citee id:466 citee title:shooting stars in the sky: an online algorithm for skyline queries citee abstract:skyline queries ask for a set of interesting points from a potentially large set of data pointsfi if we are traveling, for instance, a restaurant might be interesting if there is no other restaurant which is nearer, cheaper, and has better foodfi skyline queries retrieve all such interesting restaurants so that the user can choose the most promising onefi in this paper, we present a new online algorithm that computes the skylinefi unlike most existing algorithms that compute the skyline in a batch, surrounding text:initially skyline queries were mainly intended to be performed within a single database query engine. thus the first algorithms and subsequent improvements all work on a central (multidimensional) index structure like r*-trees [20]<2>, certain partitioning schemes [21]<2> or k-nearest-neighbor searches [***]<2>. however, such central indexes cannot be applied to distributed web information systems influence:2 type:2 pair index:537 citer id:462 citer title:efficient distributed skylining for web information systems citer abstract:though skyline queries already have claimed their place in retrieval over central databases, their application in web information systems up to now was impossible due to the distributed aspect of retrieval over web sources. but due to the amount, variety and volatile nature of information accessible over the internet extended query capabilities are crucial. we show how to efficiently perform distributed skyline queries and thus essentially extend the expressiveness of querying todays web information systems. together with our innovative retrieval algorithm we also present useful heuristics to further speed up the retrieval in most practical cases paving the road towards meeting even the realtime challenges of on-line information services. we discuss performance evaluations and point to open problems in the concept and application of skylining in modern information systems. for the curse of dimensionality, an intrinsic problem in skyline queries, we propose a novel sampling scheme that allows to get an early impression of the skyline for subsequent query refinement citee id:467 citee title:preferences: putting more knowledge into queries citee abstract:classical query languages provide a way to express mandatory qualifications on the data to be retrieved. they do not feature facilities for expressing preferences or desirable qualifications. the need for preferences is illustrated in a software engineering framework. a preference mechanism is then presented as an extension of a language of the domain relational calculus family, and the expressive power of the resulting language is discussed. the proposed mechanisms are shown to effectively allow the use of queries for supporting software configuration management functions surrounding text:the area of operations research and research in the field of human preferences like [6]<2> or [8]<2> has already since long criticized this lack in expressiveness. a more expressive model of non-discriminating combination has been introduced into the database community by [***]<2>. the skyline or pareto set is a set of nondominated answers in the result for a query under the notion of pareto optimality. 2. skyline objects and regions of domination whereas [***]<2> and the more recent extensive system in [12]<2> with an algebra for integrating the concept of pareto optimality with the top k retrieval model for preference engineering and query optimization in databases [11]<2>, are more powerful in that they do not restrict skyline queries to numerical domains, they both rely on the na. ve algorithm of quadratic complexity doing pairwise comparisons of all database objects influence:3 type:2 pair index:538 citer id:462 citer title:efficient distributed skylining for web information systems citer abstract:though skyline queries already have claimed their place in retrieval over central databases, their application in web information systems up to now was impossible due to the distributed aspect of retrieval over web sources. but due to the amount, variety and volatile nature of information accessible over the internet extended query capabilities are crucial. we show how to efficiently perform distributed skyline queries and thus essentially extend the expressiveness of querying todays web information systems. together with our innovative retrieval algorithm we also present useful heuristics to further speed up the retrieval in most practical cases paving the road towards meeting even the realtime challenges of on-line information services. we discuss performance evaluations and point to open problems in the concept and application of skylining in modern information systems. for the curse of dimensionality, an intrinsic problem in skyline queries, we propose a novel sampling scheme that allows to get an early impression of the skyline for subsequent query refinement citee id:468 citee title:efficient utility functions for ceteris paribus preferences citee abstract:ceteris paribus (other things being equal) preference provides a convenient means for stating constraints on numeric utility functions, but direct constructions of numerical utility representations from such statements have exponential worstcase cost. this paper describes more efficient constructions that combine analysis of utility independence with constraintbased search. surrounding text:besides we will focus more closely on quality aspects of skyline queries. in this context especially a-posteriori quality assessments along the lines of our sampling technique and qualitative assessments like in [***]<3> may help users to cope with large result sets. we will also investigate our proposed quality measures in more detail and evaluate their individual usefulness influence:3 type:3 pair index:539 citer id:462 citer title:efficient distributed skylining for web information systems citer abstract:though skyline queries already have claimed their place in retrieval over central databases, their application in web information systems up to now was impossible due to the distributed aspect of retrieval over web sources. but due to the amount, variety and volatile nature of information accessible over the internet extended query capabilities are crucial. we show how to efficiently perform distributed skyline queries and thus essentially extend the expressiveness of querying todays web information systems. together with our innovative retrieval algorithm we also present useful heuristics to further speed up the retrieval in most practical cases paving the road towards meeting even the realtime challenges of on-line information services. we discuss performance evaluations and point to open problems in the concept and application of skylining in modern information systems. for the curse of dimensionality, an intrinsic problem in skyline queries, we propose a novel sampling scheme that allows to get an early impression of the skyline for subsequent query refinement citee id:469 citee title:supporting ranked boolean similarity queries in mars citee abstract:to address the emerging needs of applications that require access to and retrieval of multimedia objects, weare developing the multimedia analysis and retrieval system (mars) fi in this paper, we concentrate onthe retrieval subsystem of mars and its support for content-based queries over image databasesfi content-basedretrieval techniques have been extensively studied for textual documents in the area of automatic informationretrieval fi this paper describes how these techniques surrounding text:g. [7]<1>, [9]<1>, [***]<1>. especially for contentbased retrieval of multimedia data these techniques have proven to be particularly helpful influence:3 type:3 pair index:540 citer id:462 citer title:efficient distributed skylining for web information systems citer abstract:though skyline queries already have claimed their place in retrieval over central databases, their application in web information systems up to now was impossible due to the distributed aspect of retrieval over web sources. but due to the amount, variety and volatile nature of information accessible over the internet extended query capabilities are crucial. we show how to efficiently perform distributed skyline queries and thus essentially extend the expressiveness of querying todays web information systems. together with our innovative retrieval algorithm we also present useful heuristics to further speed up the retrieval in most practical cases paving the road towards meeting even the realtime challenges of on-line information services. we discuss performance evaluations and point to open problems in the concept and application of skylining in modern information systems. for the curse of dimensionality, an intrinsic problem in skyline queries, we propose a novel sampling scheme that allows to get an early impression of the skyline for subsequent query refinement citee id:231 citee title:an optimal and progressive algorithm for skyline queries citee abstract: the skyline of a set of d - dimensional points contains the points that are not dominated by any other point on all dimensions skyline computation has recently received considerable attention in the database community, especially for progressive (or online) algorithms that can quickly return the first skyline points without having to read the entire data file currently, the most efficient algorithm is nn (nearest neighbors), which applies the divide - and - conquer framework on datasets indexed by r - trees although nn has some desirable features (such as high speed for returning the initial skyline points, applicability to arbitrary data distributions and dimensions), it also presents several inherent disadvantages (need for duplicate elimination if d>2, multiple accesses of the same node, large space overhead) in this paper we develop bbs (branch - and - bound skyline), a progressive algorithm also based on nearest neighbor search, which is io optimal, i e , it performs a single access only to those r - tree nodes that may contain skyline points furthermore, it does not retrieve duplicates and its space overhead is significantly smaller than that of nn finally, bbs is simple to implement and can be efficiently applied to a variety of alternative skyline queries an analytical and experimental comparison shows that bbs outperforms nn (usually by orders of magnitude) under all problem instances surrounding text:initially skyline queries were mainly intended to be performed within a single database query engine. thus the first algorithms and subsequent improvements all work on a central (multidimensional) index structure like r*-trees [***]<2>, certain partitioning schemes [21]<2> or k-nearest-neighbor searches [13]<2>. however, such central indexes cannot be applied to distributed web information systems influence:2 type:2 pair index:541 citer id:462 citer title:efficient distributed skylining for web information systems citer abstract:though skyline queries already have claimed their place in retrieval over central databases, their application in web information systems up to now was impossible due to the distributed aspect of retrieval over web sources. but due to the amount, variety and volatile nature of information accessible over the internet extended query capabilities are crucial. we show how to efficiently perform distributed skyline queries and thus essentially extend the expressiveness of querying todays web information systems. together with our innovative retrieval algorithm we also present useful heuristics to further speed up the retrieval in most practical cases paving the road towards meeting even the realtime challenges of on-line information services. we discuss performance evaluations and point to open problems in the concept and application of skylining in modern information systems. for the curse of dimensionality, an intrinsic problem in skyline queries, we propose a novel sampling scheme that allows to get an early impression of the skyline for subsequent query refinement citee id:159 citee title:a roadmap to advanced personalization of mobile services citee abstract:: performing complex tasks over the web has become an integral part of our everyday lifefi surrounding text:recent research on web-based information systems has focused on employing middleware algorithms, where users had to specify weightings for each aspect of their query and a central compensation function was used to find the best matching objects [7]<2>, [1]<2>. the lack of expressiveness of this top k query model, however, has first been addressed by [8]<2> and with the growing incorporation of user preferences into database systems [6]<3>, [10]<3> and information services [***]<3> the limitations of the entire model became more and more obvious. this led towards the integration of so-called skyline queries (e. we also presented a number of advanced heuristics further improve performance towards real-time applications. especially in the area of mobile information services [***]<3> using information from various content providers that is assembled on the fly for subsequent use, our algorithm will allow for more expressive queries by enabling users to specify even complex preferences in an intuitive way. confirming our optimality results our performance evaluation shows that our algorithm scales with growing database sizes and already performs well for reasonable numbers of lists to combine influence:3 type:3 pair index:542 citer id:470 citer title:efficient document retrieval in main memory citer abstract:disk access performance is a major bottleneck in traditional information retrieval systems. compared to system memory, disk bandwidth is poor, and seek times are worse. we circumvent this problem by considering query evaluation strategies in main memory. we show how new accumulator trimming techniques combined with inverted list skipping can produce extremely high performance retrieval systems without resorting to methods that may harm effectiveness. we evaluate our techniques using galago, a new retrieval system designed for efficient query processing. our system achieves a 69% improvement in query throughput over previous methods citee id:471 citee title:vector-space ranking with effective early termination citee abstract:considerable research effort has been invested in improving the effectiveness of information retrieval systems. techniques such as relevance feedback, thesaural expansion, and pivoting all provide better quality responses to queries when tested in standard evaluation frameworks. but such enhancements can add to the cost of evaluating queries. in this paper we consider the pragmatic issue of how to improve the cost-effectiveness of searching. we describe a new inverted file structure using quantized weights that provides superior retrieval effectiveness compared to conventional inverted file structures when early termination heuristics are employed. that is, we are able to reach similar effectiveness levels with less computational cost, and so provide a better cost/performance compromise than previous inverted file organisations. surrounding text:2. algorithm over the past six years, impact-sorted indexes have been shown to be an effective and efficient data structure for processing text queries [***, 2]<2>. these indexes store term weights directly in the index, like the smart system [6]<2>, however, impact-sorted indexes use a very small number of distinct term weights. in this paper we use just 8 different values. the small number of values used allows these indexes to store documents in impact order while still allowing for very high level of compression [***]<2>. to generate effective retrieval results, care must be taken in selecting the impact values assigned to each term. from an information retrieval perspective, this work can be seen as a combination of the max score work of turtle and flood [17]<2> combined with the frequency and impact sorted work of persin et al. and anh and moffat [***, 15]<2>. both brown and strohman et al. one involving assigning ranges of bm25 scores to integer values, and another using a document-centric approach. [***, 2]<2> bast et al. extend the threshold algorithm ideas of fagin et al influence:1 type:2 pair index:543 citer id:470 citer title:efficient document retrieval in main memory citer abstract:disk access performance is a major bottleneck in traditional information retrieval systems. compared to system memory, disk bandwidth is poor, and seek times are worse. we circumvent this problem by considering query evaluation strategies in main memory. we show how new accumulator trimming techniques combined with inverted list skipping can produce extremely high performance retrieval systems without resorting to methods that may harm effectiveness. we evaluate our techniques using galago, a new retrieval system designed for efficient query processing. our system achieves a 69% improvement in query throughput over previous methods citee id:472 citee title:simplified similarity scoring using term ranks citee abstract:we propose a method for document ranking that combines a simple document-centric view of text, and fast evaluation strategies that have been developed in connection with the vector space model. the new method defines the importance of a term within a document qualitatively rather than quantitatively, and in doing so reduces the need for tuning parameters. in addition, the method supports very fast query processing, with most of the computation carried out on small integers, and dynamic pruning an effective option. experiments on a wide range of trec data show that the new method provides retrieval effectiveness as good as or better than the okapi bm25 formulation, and variants of language models. surrounding text:2. algorithm over the past six years, impact-sorted indexes have been shown to be an effective and efficient data structure for processing text queries [1, ***]<2>. these indexes store term weights directly in the index, like the smart system [6]<2>, however, impact-sorted indexes use a very small number of distinct term weights. one involving assigning ranges of bm25 scores to integer values, and another using a document-centric approach. [1, ***]<2> bast et al. extend the threshold algorithm ideas of fagin et al influence:2 type:2 pair index:544 citer id:470 citer title:efficient document retrieval in main memory citer abstract:disk access performance is a major bottleneck in traditional information retrieval systems. compared to system memory, disk bandwidth is poor, and seek times are worse. we circumvent this problem by considering query evaluation strategies in main memory. we show how new accumulator trimming techniques combined with inverted list skipping can produce extremely high performance retrieval systems without resorting to methods that may harm effectiveness. we evaluate our techniques using galago, a new retrieval system designed for efficient query processing. our system achieves a 69% improvement in query throughput over previous methods citee id:473 citee title:pruned query evaluation using pre-computed impacts citee abstract:exhaustive evaluation of ranked queries can be expensive, particularly when only a small subset of the overall ranking is required, or when queries contain common terms. this concern gives rise to techniques for dynamic query pruning, that is, methods for eliminating redundant parts of the usual exhaustive evaluation, yet still generating a demonstrably "good enough" set of answers to the query. in this work we propose new pruning methods that make use of impact-sorted indexes. compared to exhaustive evaluation, the new methods reduce the amount of computation performed, reduce the amount of memory required for accumulators, reduce the amount of data transferred from disk, and at the same time allow performance guarantees in terms of precision and mean average precision. these strong claims are backed by experiments using the trec terabyte collection and queries. surrounding text:we present a new method for continuous accumulator pruning in impact-sorted indexes. our method increases query throughput by 15% over the method proposed by anh and moffat [***]<2> while still remaining rank safe results (i$e$ documents are returned in the same order that they would be in an unoptimized evaluation) . we show how our accumulator pruning technique can be combined with inverted list skipping to achieve a 69% total increase in throughput while maintaining the rank safe property. we show that storing inverted lists in memory can significantly improve performance, adding to previous results from buttcher and clarke [7]<1>. we show that the algorithm presented by anh and moffat [***]<2> can evaluate queries 7 times faster on our system than the speed quoted in their paper.. the relatively long indexing time required by our system should not be a reflection of the optimized indexing time for this task. previous work in this area indicate that impactsorted gov2 indexes can be built in under 7 hours on a typical desktop computer using optimized implementations [***]<2>. 3 influence:1 type:2 pair index:545 citer id:470 citer title:efficient document retrieval in main memory citer abstract:disk access performance is a major bottleneck in traditional information retrieval systems. compared to system memory, disk bandwidth is poor, and seek times are worse. we circumvent this problem by considering query evaluation strategies in main memory. we show how new accumulator trimming techniques combined with inverted list skipping can produce extremely high performance retrieval systems without resorting to methods that may harm effectiveness. we evaluate our techniques using galago, a new retrieval system designed for efficient query processing. our system achieves a 69% improvement in query throughput over previous methods citee id:35 citee title:a document-centric approach to static index pruning in text retrieval systems citee abstract:we present a static index pruning method, to be used in ad-hoc document retrieval tasks, that follows a documentcentric approach to decide whether a posting for a given term should remain in the index or not. the decision is made based on the term's contribution to the document's kullback-leibler divergence from the text collection's global language model. our technique can be used to decrease the size of the index by over 90%, at only a minor decrease in retrieval e ectiveness. it thus allows us to make the index small enough to fit entirely into the main memory of a single pc, even for large text collections containing millions of documents. this results in great efficiency gains, superior to those of earlier pruning methods, and an average response time around 20 ms on the gov2 document collection surrounding text:. we show that storing inverted lists in memory can significantly improve performance, adding to previous results from buttcher and clarke [***]<1>. we show that the algorithm presented by anh and moffat [3]<2> can evaluate queries 7 times faster on our system than the speed quoted in their paper influence:1 type:2 pair index:546 citer id:470 citer title:efficient document retrieval in main memory citer abstract:disk access performance is a major bottleneck in traditional information retrieval systems. compared to system memory, disk bandwidth is poor, and seek times are worse. we circumvent this problem by considering query evaluation strategies in main memory. we show how new accumulator trimming techniques combined with inverted list skipping can produce extremely high performance retrieval systems without resorting to methods that may harm effectiveness. we evaluate our techniques using galago, a new retrieval system designed for efficient query processing. our system achieves a 69% improvement in query throughput over previous methods citee id:474 citee title:the trec 2006 terabyte track citee abstract:trec 2006 is the third year for the track. the track was introduced as part of trec 2004, with a single adhoc retrieval task. for trec 2005, the track was expanded with two optional tasks: a named page finding task and an efficiency task. these three tasks were continued in 2006, with 20 groups submitting runs to the adhoc retrieval task, 11 groups submitting runs to the named page finding task, and 8 groups submitting runs to the efficiency task. this report provides an overview of each task, summarizes the results, and outlines directions for the future. further background information on the development of the track can be found in the 2004 and 2005 track reports surrounding text:query performance improves in part because of memory speed and in part because of the smaller amount of data, although these different factors were not analyzed in detail. more recent results from the trec 2006 terabyte track shows that other researchers have considered static pruning [***]<2>. while the actual process of storing precomputed scores in lists is not the subject of this paper, there are many examples in the literature of researchers doing this influence:3 type:3 pair index:547 citer id:470 citer title:efficient document retrieval in main memory citer abstract:disk access performance is a major bottleneck in traditional information retrieval systems. compared to system memory, disk bandwidth is poor, and seek times are worse. we circumvent this problem by considering query evaluation strategies in main memory. we show how new accumulator trimming techniques combined with inverted list skipping can produce extremely high performance retrieval systems without resorting to methods that may harm effectiveness. we evaluate our techniques using galago, a new retrieval system designed for efficient query processing. our system achieves a 69% improvement in query throughput over previous methods citee id:40 citee title:static index pruning for information retrieval systems citee abstract:we introduce static index pruning methods that significantly reduce the index size in information retrieval systems. we investigate uniform and term-based methods that each remove selected entries from the index and yet have only a minor effect on retrieval results. in uniform pruning, there is a fixed cutoff threshold, and all index entries whose contribution to relevance scores is bounded above by a given threshold are removed from the index. in term-based pruning, the cutoff threshold is determined for each term, and thus may vary from term to term. we give experimental evidence that for each level of compression, term-based pruning outperforms uniform pruning, under various measures of precision. we present theoretical and experimental evidence that under our term-based pruning scheme, it is possible to prune the index greatly and still get retrieval results that are almost as good as those based on the full index surrounding text:carmel et al. considered this process [***]<2>. more recently, buttcher and clarke considered index pruning specifically so that the resulting index would fit in memory, although supplemental disk indexes are sometimes used for additional information influence:2 type:2 pair index:548 citer id:470 citer title:efficient document retrieval in main memory citer abstract:disk access performance is a major bottleneck in traditional information retrieval systems. compared to system memory, disk bandwidth is poor, and seek times are worse. we circumvent this problem by considering query evaluation strategies in main memory. we show how new accumulator trimming techniques combined with inverted list skipping can produce extremely high performance retrieval systems without resorting to methods that may harm effectiveness. we evaluate our techniques using galago, a new retrieval system designed for efficient query processing. our system achieves a 69% improvement in query throughput over previous methods citee id:475 citee title:mapreduce: simplified data processing on large clusters citee abstract:mapreduce is a programming model and an associated implementation for processing and generating large data sets. users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. many real world tasks are expressible in this model, as shown in the paper. programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. the run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. this allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. our implementation of mapreduce runs on a large cluster of commodity machines and is highly scalable: a typical mapreduce computation processes many terabytes of data on thousands of machines. programmers find the system easy to use: hundreds of mapreduce programs have been implemented and upwards of one thousand mapreduce jobs are executed on google's clusters every day surrounding text:our indexer was built for simplicity, experimentation and extreme parallelism. we used an approach that combines ideas from early information retrieval systems with principles from dean and ghemawats mapreduce [***]<1> in order to make a simple, parallel indexer. the first stage processes text documents, converting them into compressed lists of (document, word, count) tuples influence:3 type:3 pair index:549 citer id:470 citer title:efficient document retrieval in main memory citer abstract:disk access performance is a major bottleneck in traditional information retrieval systems. compared to system memory, disk bandwidth is poor, and seek times are worse. we circumvent this problem by considering query evaluation strategies in main memory. we show how new accumulator trimming techniques combined with inverted list skipping can produce extremely high performance retrieval systems without resorting to methods that may harm effectiveness. we evaluate our techniques using galago, a new retrieval system designed for efficient query processing. our system achieves a 69% improvement in query throughput over previous methods citee id:254 citee title:optimal aggregation algorithms for middleware citee abstract:assume that each object in a database has m grades, or scores, one for eachof m attributes. for example, an object can have a color grade, that tells how red it is, and a shape grade, that tells how round it is. for each attribute, there is a sorted list, which lists each object and its grade under that attribute, sorted by grade (highest grade first). each object is assigned an overall grade, that is obtained by combining the attribute grades using a fixed monotone aggregation function, or combining rule, suchas min or average. to determine the top k objects, that is, k objects with the highest overall grades, the naive algorithm must access every object in the database, to find its grade under each attribute. fagin has given an algorithm (fagins algorithm, or fa) that is much more efficient. for some monotone aggregation functions, fa is optimal with high probability in the worst case. we analyze an elegant and remarkably simple algorithm (the threshold algorithm, or ta) that is optimal in a much stronger sense than fa. we show that ta is essentially optimal, not just for some monotone aggregation functions, but for all of them, and not just in a high-probability worst-case sense, but over every database. unlike fa, which requires large buffers (whose size may grow unboundedly as the database size grows), ta requires only a small, constant-size buffer. ta allows early stopping, which yields, in a precise sense, an approximate version of the top k answers. we distinguish two types of access: sorted access (where the middleware system obtains the grade of an object in some sorted list by proceeding through the list sequentially from the top), and random access (where the middleware system requests the grade of object in a list, and obtains it in one step). we consider the scenarios where random access is either impossible, or expensive relative to sorted access, and provide algorithms that are essentially optimal for these cases as well. r 2003 elsevier science (usa). all rights reserved. surrounding text:roughly at the same time as the first impact-sorted work, fagin et al. proposed a class of algorithms known as threshold algorithms [***]<2>. these algorithms, like the ones shown in this paper, provide a method for efficiently computing aggregate functions over multiple sorted lists by maintaining statistics about the data that remains to be read influence:2 type:2 pair index:550 citer id:470 citer title:efficient document retrieval in main memory citer abstract:disk access performance is a major bottleneck in traditional information retrieval systems. compared to system memory, disk bandwidth is poor, and seek times are worse. we circumvent this problem by considering query evaluation strategies in main memory. we show how new accumulator trimming techniques combined with inverted list skipping can produce extremely high performance retrieval systems without resorting to methods that may harm effectiveness. we evaluate our techniques using galago, a new retrieval system designed for efficient query processing. our system achieves a 69% improvement in query throughput over previous methods citee id:476 citee title:space-limited ranked query evaluation using adaptive pruning citee abstract: evaluation of ranked queries on large text collections can be costly in terms of processing time and memory space dynamic pruning techniques allow both costs to be reduced, at the potential risk of decreased retrieval effectiveness in this paper we describe an improved query pruning mechanism that offers a more resilient tradeoff between query evaluation costs and retrieval effectiveness than do previous pruning approaches surrounding text:lester et al. [***]<2> show how the memory footprint of accumulators can be significantly reduced without loss of effectiveness. their algorithm scans inverted lists in document order, but processes only postings with term counts larger than some threshold influence:1 type:2 pair index:551 citer id:470 citer title:efficient document retrieval in main memory citer abstract:disk access performance is a major bottleneck in traditional information retrieval systems. compared to system memory, disk bandwidth is poor, and seek times are worse. we circumvent this problem by considering query evaluation strategies in main memory. we show how new accumulator trimming techniques combined with inverted list skipping can produce extremely high performance retrieval systems without resorting to methods that may harm effectiveness. we evaluate our techniques using galago, a new retrieval system designed for efficient query processing. our system achieves a 69% improvement in query throughput over previous methods citee id:43 citee title:self-indexing inverted files for fast text retrieval citee abstract:query processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. here we show that query response time for conjunctive boolean queries and for informal ranked queries can be dramatically reduced, at little cost in terms of storage, by the inclusion of an internal index in each inverted list. this method has been applied in a retrieval system for a collection of nearly two million short documents. our experimental results show that the selfindexing strategy adds less than 20% to the size of the inverted file, but, for boolean queries of 5{10 terms, can reduce processing time to under one fifth of the previous cost. similarly, ranked queries of 40{50 terms can be evaluated in as little as 25% of the previous time, with little or no loss of retrieval e ectiveness surrounding text:choosing skip lengths when adding skip information to an index, it makes sense to ask how long the skip distance should be. moffat and zobel [***]<1> suggest a method for determining this parameter. we use a slightly different formulation in this paper which remains in the same spirit influence:2 type:2 pair index:552 citer id:470 citer title:efficient document retrieval in main memory citer abstract:disk access performance is a major bottleneck in traditional information retrieval systems. compared to system memory, disk bandwidth is poor, and seek times are worse. we circumvent this problem by considering query evaluation strategies in main memory. we show how new accumulator trimming techniques combined with inverted list skipping can produce extremely high performance retrieval systems without resorting to methods that may harm effectiveness. we evaluate our techniques using galago, a new retrieval system designed for efficient query processing. our system achieves a 69% improvement in query throughput over previous methods citee id:44 citee title:filtered document retrieval with frequency-sorted indexes citee abstract:ranking techniques are e ective at finding answers in document collections but can be expensive to evaluate. we propose an evaluation technique that uses early recognition of which documents are likely to be highly ranked to reduce costs; for our test data, queries are evaluated in 2% of the memory of the standard implementation without degradation in retrieval e ectiveness. cpu time and disk traffic can also be dramatically reduced by designing inverted indexes explicitly to support the technique. the principle of the index design is that inverted lists are sorted by decreasing within-document frequency rather than by document number, and this method experimentally reduces cpu time and disk traffic to around one third of the original requirement. we also show that frequency sorting can lead to a net reduction in index size, regardless of whether the index is compressed surrounding text:these indexes are a natural next step following the frequency-sorted indexes of persin et al. [***]<1>. the frequency-sorted indexes suggested by persin et al influence:1 type:1 pair index:553 citer id:470 citer title:efficient document retrieval in main memory citer abstract:disk access performance is a major bottleneck in traditional information retrieval systems. compared to system memory, disk bandwidth is poor, and seek times are worse. we circumvent this problem by considering query evaluation strategies in main memory. we show how new accumulator trimming techniques combined with inverted list skipping can produce extremely high performance retrieval systems without resorting to methods that may harm effectiveness. we evaluate our techniques using galago, a new retrieval system designed for efficient query processing. our system achieves a 69% improvement in query throughput over previous methods citee id:477 citee title:optimization strategies for complex queries citee abstract:previous research into the efficiency of text retrieval systems has dealt primarily with methods that consider inverted lists in sequence; these methods are known as term-at-a-time methods. however, the literature for optimizing documentata-time systems remains sparse. we present an improvement to the max score optimization, which is the most efficient known document-at-a-time scoring method. like max score, our technique, called term bounded max score, is guaranteed to return exactly the same scores and documents as an unoptimized evaluation, which is particularly useful for query model research. we simulated our technique to explore the problem space, then implemented it in indri, our large scale language modeling search engine. tests with the gov2 corpus on title queries show our method to be 23% faster than max score alone, and 61% faster than our document-at-a-time baseline. our optimized query times are competitive with conventional termata-time systems on this years trec terabyte task surrounding text:considered supplemental lists of top scoring documents during query evaluation which can also be considered part of this tradition. [5, ***]<2>. it is possible to have some of the benefits of impact-sorted lists while still sorting by document influence:2 type:2 pair index:554 citer id:470 citer title:efficient document retrieval in main memory citer abstract:disk access performance is a major bottleneck in traditional information retrieval systems. compared to system memory, disk bandwidth is poor, and seek times are worse. we circumvent this problem by considering query evaluation strategies in main memory. we show how new accumulator trimming techniques combined with inverted list skipping can produce extremely high performance retrieval systems without resorting to methods that may harm effectiveness. we evaluate our techniques using galago, a new retrieval system designed for efficient query processing. our system achieves a 69% improvement in query throughput over previous methods citee id:478 citee title:monetdb/x100 - a dbms in the cpu cache citee abstract:x100 is a new execution engine for the monetdb system, that improves execution speed and overcomes its main memory limitation. it introduces the concept of in-cache vectorized processing that strikes a balance between the existing column-at-a-time mil execution primitives of monetdb and the tuple-at-a-time volcano pipelining model, avoiding their drawbacks: intermediate result materialization and large interpretation overhead, respectively. we explain how the new query engine makes better use of cache memories as well as parallel computation resources of modern super-scalar cpus. monetdb/x100 can be one to two orders of magnitude faster than commercial dbmss and close to hand-coded c programs for computationally intensive queries on in-memory datasets. to address larger disk-based datasets with the same efficiency, a new columnbm storage layer is developed that boosts bandwidth using ultra lightweight compression and cooperative scans. surrounding text:while memory-optimized processing is a relatively new field for information retrieval research, it is well studied in the database community. the monetdb system is the traditional example of a main-memory database system [***]<2>. monetdb stores relational data by column instead of by row, which significantly increases the speed of certain classes of data warehousing database transactions influence:2 type:2 pair index:555 citer id:475 citer title:mapreduce: simplified data processing on large clusters citer abstract:mapreduce is a programming model and an associated implementation for processing and generating large data sets. users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. many real world tasks are expressible in this model, as shown in the paper. programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. the run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. this allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. our implementation of mapreduce runs on a large cluster of commodity machines and is highly scalable: a typical mapreduce computation processes many terabytes of data on thousands of machines. programmers find the system easy to use: hundreds of mapreduce programs have been implemented and upwards of one thousand mapreduce jobs are executed on google's clusters every day citee id:611 citee title:high-performance sorting on networks of workstations citee abstract:we report the performance of now-sort, a collection of sorting implementations on a network of workstations (now). we find that parallel sorting on a now is competitive to sorting on the large-scale smps that have traditionally held the performance records. on a 64-node cluster, we sort 6.0 gb in just under one minute, while a 32-node cluster finishes the datamation benchmark in 2.41 seconds. our implementations can be applied to a variety of disk, memory, and processor configurations; we highlight salient issues for tuning each component of the system. we evaluate the use of commodity operating systems and hardware for parallel sorting. we find existing os primitives for memory management and file access adequate. due to aggregate communication and disk bandwidth requirements, the bottleneck of our system is the workstation i/o bus. surrounding text:though not the focus of this paper, the cluster management system is similar in spirit to other systems such as condor [16]<1>. the sorting facility that is a part of the mapreduce library is similar in operation to now-sort [***]<2>. source machines (map workers) partition the data to be sorted and send it to one of r reduce workers influence:3 type:2 pair index:556 citer id:475 citer title:mapreduce: simplified data processing on large clusters citer abstract:mapreduce is a programming model and an associated implementation for processing and generating large data sets. users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. many real world tasks are expressible in this model, as shown in the paper. programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. the run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. this allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. our implementation of mapreduce runs on a large cluster of commodity machines and is highly scalable: a typical mapreduce computation processes many terabytes of data on thousands of machines. programmers find the system easy to use: hundreds of mapreduce programs have been implemented and upwards of one thousand mapreduce jobs are executed on google's clusters every day citee id:298 citee title:cluster i/o with river: making the fast case common citee abstract:we introduce river, a data-flow programming environment and i/o substrate for clusters of computers. river is designed to provide maximum performance in the common case -- even in the face of non-uniformities in hardware, software, and workload. river is based on two simple design features: a high-performance distributed queue, and a storage redundancy mechanism called graduated declustering. we have implemented a number of data-intensive applications on river, which validate our design with surrounding text:of course now-sort does not have the user-definable map and reduce functions that make our library widely applicable. river [***]<2> provides a programming model where processes communicate with each other by sending data over distributed queues. like mapreduce, the river system tries to provide good average case performance even in the presence of non-uniformities introduced by heterogeneous hardware or system perturbations influence:3 type:2 pair index:557 citer id:475 citer title:mapreduce: simplified data processing on large clusters citer abstract:mapreduce is a programming model and an associated implementation for processing and generating large data sets. users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. many real world tasks are expressible in this model, as shown in the paper. programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. the run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. this allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. our implementation of mapreduce runs on a large cluster of commodity machines and is highly scalable: a typical mapreduce computation processes many terabytes of data on thousands of machines. programmers find the system easy to use: hundreds of mapreduce programs have been implemented and upwards of one thousand mapreduce jobs are executed on google's clusters every day citee id:294 citee title:charlotte: metacomputing on the web citee abstract:the world wide web has the potential of being used as an inexpensive and convenient metacomputing resource. this brings forward new challenges and invalidates many of the assumptions made in offering the same functionality for a network of workstations. we have designed and implemented charlotte which goes beyond providinga set of features commonly used for a network of workstations: (1) a user can execute a parallel program on a machine she does not have an account on; (2) neither a shared surrounding text:we run on commodity processors to which a small number of disks are directly connected instead of running directly on disk controller processors, but the general approach is similar. our backup task mechanism is similar to the eager scheduling mechanism employed in the charlotte system [***]<2>. one of the shortcomings of simple eager scheduling is that if a given task causes repeated failures, the entire computation fails to complete influence:3 type:2 pair index:558 citer id:475 citer title:mapreduce: simplified data processing on large clusters citer abstract:mapreduce is a programming model and an associated implementation for processing and generating large data sets. users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. many real world tasks are expressible in this model, as shown in the paper. programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. the run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. this allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. our implementation of mapreduce runs on a large cluster of commodity machines and is highly scalable: a typical mapreduce computation processes many terabytes of data on thousands of machines. programmers find the system easy to use: hundreds of mapreduce programs have been implemented and upwards of one thousand mapreduce jobs are executed on google's clusters every day citee id:705 citee title:web search for a planet: the google cluster architecture citee abstract:application lets different queries run on different processors and, by partitioning the overall index, also lets a single query use multiple processors. to handle this workload, googles architecture features clusters of more than 15,000 commodityclass pcs with fault-tolerant software. this architecture achieves superior performance at a fraction of the cost of a system built from fewer, but more expensive, high-end servers surrounding text:for example, one implementation may be suitable for a small shared-memory machine, another for a large numa multi-processor, and yet another for an even larger collection of networked machines. this section describes an implementation targeted to the computing environment in wide use at google: large clusters of commodity pcs connected together with switched ethernet [***]<3>. in our environment: (1) machines are typically dual-processor x86 processors running linux, with 2-4 gb of memory per machine influence:3 type:3 pair index:559 citer id:475 citer title:mapreduce: simplified data processing on large clusters citer abstract:mapreduce is a programming model and an associated implementation for processing and generating large data sets. users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. many real world tasks are expressible in this model, as shown in the paper. programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. the run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. this allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. our implementation of mapreduce runs on a large cluster of commodity machines and is highly scalable: a typical mapreduce computation processes many terabytes of data on thousands of machines. programmers find the system easy to use: hundreds of mapreduce programs have been implemented and upwards of one thousand mapreduce jobs are executed on google's clusters every day citee id:537 citee title:explicit control in a batch-aware distributed file system citee abstract:we present the design, implementation, and evaluation of the batch-aware distributed file system (bad-fs), a system designed to orchestrate large, i/o-intensive batch workloads on remote computing clusters distributed across the wide area. bad-fs consists of two novel components: a storage layer that exposes control of traditionally fixed policies such as caching, consistency, and replication; and a scheduler that exploits this control as necessary for different workloads. by extracting control from the storage layer and placing it within an external scheduler, bad-fs manages both storage and computation in a coordinated way while gracefully dealing with cache consistency, fault-tolerance, and space management issues in a workload-specific manner. using both microbenchmarks and real workloads, we demonstrate the performance benefits of explicit control, delivering excellent end-to-end performance across the wide-area. surrounding text:the restricted programming model also allows us to schedule redundant executions of tasks near the end of the job which greatly reduces completion time in the presence of non-uniformities (such as slow or stuck workers). bad-fs [***]<2> has a very different programming model from mapreduce, and unlike mapreduce, is targeted to to appear in osdi 2004 11 the execution of jobs across a wide-area network. however, there are two fundamental similarities influence:2 type:2 pair index:560 citer id:475 citer title:mapreduce: simplified data processing on large clusters citer abstract:mapreduce is a programming model and an associated implementation for processing and generating large data sets. users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. many real world tasks are expressible in this model, as shown in the paper. programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. the run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. this allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. our implementation of mapreduce runs on a large cluster of commodity machines and is highly scalable: a typical mapreduce computation processes many terabytes of data on thousands of machines. programmers find the system easy to use: hundreds of mapreduce programs have been implemented and upwards of one thousand mapreduce jobs are executed on google's clusters every day citee id:706 citee title:scans as primitive parallel operations citee abstract:a study of the effects of adding two scan primitives as unit-time primitives to pram (parallel random access machine) models is presented. it is shown that the primitives improve the asymptotic running time of many algorithms by an o(log n) factor, greatly simplifying the description of many algorithms, and are significantly easier to implement than memory references. it is argued that the algorithm designer should feel free to use these operations as if they were as cheap as a memory reference. the author describes five algorithms that clearly illustrate how the scan primitives can be used in algorithm design: a radix-sort algorithm, a quicksort algorithm, a minimum-spanning-tree algorithm, a line-drawing algorithm, and a merging algorithm. these all run on an erew (exclusive read, exclusive write) pram with the addition of two scan primitives and are either simpler or more efficient than their pure pram counterparts. the scan primitives have been implemented in microcode on the connection machine system, are available in paris (the parallel instruction set of the machine) surrounding text:7 related work many systems have provided restricted programming models and used the restrictions to parallelize the computation automatically. for example, an associative function can be computed over all prefixes of an n element array in logn time on n processors using parallel prefix computations [***, 9, 13]<2>. mapreduce can be considered a simplification and distillation of some of these models based on our experience with large real-world computations influence:2 type:2 pair index:561 citer id:475 citer title:mapreduce: simplified data processing on large clusters citer abstract:mapreduce is a programming model and an associated implementation for processing and generating large data sets. users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. many real world tasks are expressible in this model, as shown in the paper. programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. the run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. this allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. our implementation of mapreduce runs on a large cluster of commodity machines and is highly scalable: a typical mapreduce computation processes many terabytes of data on thousands of machines. programmers find the system easy to use: hundreds of mapreduce programs have been implemented and upwards of one thousand mapreduce jobs are executed on google's clusters every day citee id:299 citee title:cluster-based scalable network services citee abstract:we identify three fundamental requirements for scalable net-work services: incremental scalability and overflow growth provi-sioning, 24x7 availability through fault masking, and cost-effectiveness. we argue that clusters of commodity workstationsinterconnected by a high-speed san are exceptionally well-suitedto meeting these challenges for internet-server workloads, pro-vided the software infrastructure for managing partial failures andadministering a large cluster does not have to be reinvented foreach new service. to this end, we propose a general, layered archi-tecture for building cluster-based scalable network services thatencapsulates the above requirements for reuse, and a service-pro-gramming model based on composable workers that perform trans-formation, aggregation, caching, and customization (tacc) ofinternet content. for both performance and implementation sim-plicity, the architecture and tacc programming model exploitbase, a weaker-than-acid data semantics that results from trad-ing consistency for availability and relying on soft state for robust-ness in failure management. our architecture can be used as anoff the shelf infrastructural platform for creating new networkservices, allowing authors to focus on the content of the service(by composing tacc building blocks) rather than its implementa-tion. we discuss two real implementations of services based on thisarchitecture: transend, a web distillation proxy deployed to theuc berkeley dialup ip population, and hotbot, the commercialimplementation of the inktomi search engine. we present detailedmeasurements of transends performance based on substantial cli-ent traces, as well as anecdotal evidence from the transend andhotbot experience, to support the claims made for the architecture. surrounding text:(2) both use locality-aware scheduling to reduce the amount of data sent across congested network links. tacc [***]<2> is a system designed to simplify construction of highly-available networked services. like mapreduce, it relies on re-execution as a mechanism for implementing fault-tolerance influence:2 type:2 pair index:562 citer id:475 citer title:mapreduce: simplified data processing on large clusters citer abstract:mapreduce is a programming model and an associated implementation for processing and generating large data sets. users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. many real world tasks are expressible in this model, as shown in the paper. programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. the run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. this allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. our implementation of mapreduce runs on a large cluster of commodity machines and is highly scalable: a typical mapreduce computation processes many terabytes of data on thousands of machines. programmers find the system easy to use: hundreds of mapreduce programs have been implemented and upwards of one thousand mapreduce jobs are executed on google's clusters every day citee id:707 citee title:systematic efficient parallelization of scan and other list homomorphisms citee abstract:homomorphisms are functions which can be parallelized by the divide-and-conquer paradigmfi a class of distributable homomorphisms (dh) is introduced and an efficient parallel implementation schema for all functions of the class is derived by transformations in the bird-meertens formalismfi the schema can be directly mapped on the hypercube with an unlimited or an arbitrary fixed number of processors, providing provable correctness and predictable performancefi the popular scan-function (parallel surrounding text:7 related work many systems have provided restricted programming models and used the restrictions to parallelize the computation automatically. for example, an associative function can be computed over all prefixes of an n element array in logn time on n processors using parallel prefix computations [6, ***, 13]<2>. mapreduce can be considered a simplification and distillation of some of these models based on our experience with large real-world computations influence:2 type:2 pair index:563 citer id:475 citer title:mapreduce: simplified data processing on large clusters citer abstract:mapreduce is a programming model and an associated implementation for processing and generating large data sets. users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. many real world tasks are expressible in this model, as shown in the paper. programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. the run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. this allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. our implementation of mapreduce runs on a large cluster of commodity machines and is highly scalable: a typical mapreduce computation processes many terabytes of data on thousands of machines. programmers find the system easy to use: hundreds of mapreduce programs have been implemented and upwards of one thousand mapreduce jobs are executed on google's clusters every day citee id:415 citee title:diamond: a storage architecture for early discard in interactive search citee abstract: this paper explores the concept of early discard for in - teractive search of unindexed data processing data in - side storage devices using downloaded searchlet code enables diamond to perform efficient, application - specific filtering of large data collections early dis - card helps users who are looking for "needles in a haystack" by eliminating the bulk of the irrelevant items as early as possible a searchlet consists of a set of application - generated filters that diamond uses to deter - mine whether an object may be of interest to the user the system optimizes the evaluation order of the filters based on run - time measurements of each filter"s selec - tivity and computational cost diamond can also dy - namically partition computation between the storage de - vices and the host computer to adjust for changes in hardware and network conditions performance num - bers show that diamond dynamically adapts to a query and to run - time system state an informal user study of an image retrieval application supports our belief that early discard significantly improves the quality of inter - active searches surrounding text:a key difference between these systems and mapreduce is that mapreduce exploits a restricted programming model to parallelize the user program automatically and to provide transparent fault-tolerance. our locality optimization draws its inspiration from techniques such as active disks [***, 15]<1>, where computation is pushed into processing elements that are close to local disks, to reduce the amount of data sent across i/o subsystems or the network. we run on commodity processors to which a small number of disks are directly connected instead of running directly on disk controller processors, but the general approach is similar influence:2 type:1 pair index:564 citer id:475 citer title:mapreduce: simplified data processing on large clusters citer abstract:mapreduce is a programming model and an associated implementation for processing and generating large data sets. users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. many real world tasks are expressible in this model, as shown in the paper. programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. the run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. this allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. our implementation of mapreduce runs on a large cluster of commodity machines and is highly scalable: a typical mapreduce computation processes many terabytes of data on thousands of machines. programmers find the system easy to use: hundreds of mapreduce programs have been implemented and upwards of one thousand mapreduce jobs are executed on google's clusters every day citee id:708 citee title:parallel prefix computation citee abstract: the prefix problem is to compute all the products x t o x2fi o xk for i ~ kfi ~ n, where o is an associative operation a recurstve construction is used to obtain a product circuit for solving the prefix problem which has depth exactly and size bounded by 4n an application yields fast, small boolean ctrcmts to simulate fimte-state transducersfi by simulating a sequentml adder, a boolean clrcmt which has depth 2 + 2 and size bounded by 14n is obtained for n-bit binary addmon the size can be decreased significantly by permitting the depth to increase by an addmve constant surrounding text:7 related work many systems have provided restricted programming models and used the restrictions to parallelize the computation automatically. for example, an associative function can be computed over all prefixes of an n element array in logn time on n processors using parallel prefix computations [6, 9, ***]<2>. mapreduce can be considered a simplification and distillation of some of these models based on our experience with large real-world computations influence:2 type:2 pair index:565 citer id:475 citer title:mapreduce: simplified data processing on large clusters citer abstract:mapreduce is a programming model and an associated implementation for processing and generating large data sets. users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. many real world tasks are expressible in this model, as shown in the paper. programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. the run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. this allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. our implementation of mapreduce runs on a large cluster of commodity machines and is highly scalable: a typical mapreduce computation processes many terabytes of data on thousands of machines. programmers find the system easy to use: hundreds of mapreduce programs have been implemented and upwards of one thousand mapreduce jobs are executed on google's clusters every day citee id:461 citee title:efficient dispersal of information for security, load balancing and fault tolerance citee abstract:an information dispersal algorithm (ida) is developed that breaks a file f of length l = ↿ f↾ into n pieces fi, l i n, each of length ↿fi↾ = l/m, so that every m pieces suffice for reconstructing f. dispersal and reconstruction are computationally efficient. the sum of the lengths ↿fi↾ is (n/m) l. since n/m can be chosen to be close to l, the ida is space efficient. ida has numerous applications to secure and reliable storage of information in computer networks and even on single disks, to fault-tolerant and efficient transmission of information in networks, and to communications between processors in parallel computers. for the latter problem provably time-efficient and highly fault-tolerant routing on the n-cube is achieved, using just constant size buffers. surrounding text:we write two replicas because that is the mechanism for reliability and availability provided by our underlying file system. network bandwidth requirements for writing data would be reduced if the underlying file system used erasure coding [***]<3> rather than replication. to appear in osdi 2004 9 5 influence:3 type:3 pair index:566 citer id:475 citer title:mapreduce: simplified data processing on large clusters citer abstract:mapreduce is a programming model and an associated implementation for processing and generating large data sets. users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. many real world tasks are expressible in this model, as shown in the paper. programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. the run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. this allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. our implementation of mapreduce runs on a large cluster of commodity machines and is highly scalable: a typical mapreduce computation processes many terabytes of data on thousands of machines. programmers find the system easy to use: hundreds of mapreduce programs have been implemented and upwards of one thousand mapreduce jobs are executed on google's clusters every day citee id:198 citee title:active disks for large-scale data processing citee abstract:ory cost decreases, system intelligence con-tinues to move away from the cpu and into peripheralsfi storage system designers use this trend toward excess computing power to per-form more complex processing and optimizations in-side storage devicesfi to date, such optimizations take place at relatively low levels of the storage protocolfi trends in storage density, mechanics, and electronics eliminate the hardware bottleneck and put pressure on interconnects and hosts to move data more efficientlyfi we propose using an active disk storage device that combines on-drive processing and memory with soft-ware downloadability to allow disks to execute appli-cation-level functions directly at the devicefi moving portions of an application's processing to a storage device significantly reduces data traffic and leverages the parallelism already present in large systems, dra-matically reducing the execution time for many basic data mining tasksfi surrounding text:a key difference between these systems and mapreduce is that mapreduce exploits a restricted programming model to parallelize the user program automatically and to provide transparent fault-tolerance. our locality optimization draws its inspiration from techniques such as active disks [12, ***]<1>, where computation is pushed into processing elements that are close to local disks, to reduce the amount of data sent across i/o subsystems or the network. we run on commodity processors to which a small number of disks are directly connected instead of running directly on disk controller processors, but the general approach is similar influence:2 type:1 pair index:567 citer id:475 citer title:mapreduce: simplified data processing on large clusters citer abstract:mapreduce is a programming model and an associated implementation for processing and generating large data sets. users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. many real world tasks are expressible in this model, as shown in the paper. programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. the run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. this allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. our implementation of mapreduce runs on a large cluster of commodity machines and is highly scalable: a typical mapreduce computation processes many terabytes of data on thousands of machines. programmers find the system easy to use: hundreds of mapreduce programs have been implemented and upwards of one thousand mapreduce jobs are executed on google's clusters every day citee id:441 citee title:distributed computing in practice: the condor experience citee abstract:since 1984, the condor project has enabled ordinary users to do extraordinary computing. today, the project continues to explore the social and technical problems of cooperative computing on scales ranging from the desktop to the world-wide computational grid. in this chapter, we provide the history and philosophy of the condor project and describe how it has interacted with other projects and evolved along with the field of distributed computing. we outline the core components of the condor system and describe how the technology of computing must correspond to social structures. throughout, we reflect on the lessons of experience and chart the course traveled by research ideas as they grow into production systems. surrounding text:the mapreduce implementation relies on an in-house cluster management system that is responsible for distributing and running user tasks on a large collection of shared machines. though not the focus of this paper, the cluster management system is similar in spirit to other systems such as condor [***]<1>. the sorting facility that is a part of the mapreduce library is similar in operation to now-sort [1]<2> influence:3 type:2 pair index:568 citer id:475 citer title:mapreduce: simplified data processing on large clusters citer abstract:mapreduce is a programming model and an associated implementation for processing and generating large data sets. users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. many real world tasks are expressible in this model, as shown in the paper. programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. the run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. this allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. our implementation of mapreduce runs on a large cluster of commodity machines and is highly scalable: a typical mapreduce computation processes many terabytes of data on thousands of machines. programmers find the system easy to use: hundreds of mapreduce programs have been implemented and upwards of one thousand mapreduce jobs are executed on google's clusters every day citee id:7 citee title:a bridging model for parallel computation citee abstract:the success of the von neumann model of sequential computation is attributable to the fact that it is an efficient bridge between software and hardware: high-level languages can be efficiently compiled on to this model; yet it can be effeciently implemented in hardware. the author argues that an analogous bridge between software and hardware in required for parallel computation if that is to become as widely used. this article introduces the bulk-synchronous parallel (bsp) model as a candidate for this role, and gives results quantifying its efficiency both in implementing high-level language features and algorithms, as well as in being implemented in hardware. surrounding text:in contrast, most of the parallel processing systems have only been implemented on smaller scales and leave the details of handling machine failures to the programmer. bulk synchronous programming [***]<2> and some mpi primitives [11]<2> provide higher-level abstractions that make it easier for programmers to write parallel programs. a key difference between these systems and mapreduce is that mapreduce exploits a restricted programming model to parallelize the user program automatically and to provide transparent fault-tolerance influence:2 type:2 pair index:569 citer id:475 citer title:mapreduce: simplified data processing on large clusters citer abstract:mapreduce is a programming model and an associated implementation for processing and generating large data sets. users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. many real world tasks are expressible in this model, as shown in the paper. programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. the run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. this allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. our implementation of mapreduce runs on a large cluster of commodity machines and is highly scalable: a typical mapreduce computation processes many terabytes of data on thousands of machines. programmers find the system easy to use: hundreds of mapreduce programs have been implemented and upwards of one thousand mapreduce jobs are executed on google's clusters every day citee id:709 citee title:spsort: how to sort a terabyte quickly citee abstract:in december 1998, a 488 node ibm rs/6000 sp*sorted a terabyte of data (10 billion 100 byte records) in17 minutes, 37 secondsfi this is more than 2fi5 times faster than the previous record for a problem of thismagnitudefi the spsort program itself was custom-designed for this benchmark, but the cluster, itsinterconnection hardware, disk subsystem, operating system, file system, communication library, and jobmanagement software are all ibm productsfi the system sustained an aggregate data surrounding text:including startup overhead, the entire computation takes 891 seconds. this is similar to the current best reported result of 1057 seconds for the terasort benchmark [***]<3>. a few things to note: the input rate is higher than the shufe rate and the output rate because of our locality optimization influence:3 type:3 pair index:570 citer id:477 citer title:optimization strategies for complex queries citer abstract:previous research into the efficiency of text retrieval systems has dealt primarily with methods that consider inverted lists in sequence; these methods are known as term-at-a-time methods. however, the literature for optimizing documentata-time systems remains sparse. we present an improvement to the max score optimization, which is the most efficient known document-at-a-time scoring method. like max score, our technique, called term bounded max score, is guaranteed to return exactly the same scores and documents as an unoptimized evaluation, which is particularly useful for query model research. we simulated our technique to explore the problem space, then implemented it in indri, our large scale language modeling search engine. tests with the gov2 corpus on title queries show our method to be 23% faster than max score alone, and 61% faster than our document-at-a-time baseline. our optimized query times are competitive with conventional termata-time systems on this years trec terabyte task citee id:489 citee title:efficient query evaluation using a two-level retrieval process citee abstract:we present an efficient query evaluation method based on a two level approach: at the first level, our method iterates in parallel over query term postings and identifies candidate documents using an approximate evaluation taking into account only partial information on term occurrences and no query independent factors; at the second level, promising candidates are fully evaluated and their exact scores are computed. the efficiency of the evaluation process can be improved significantly using dynamic pruning techniques with very little cost in effectiveness. the amount of pruning can be controlled by the user as a function of time allocated for query evaluation. experimentally, using the trec web track data, we have determined that our algorithm significantly reduces the total number of full evaluations by more than 90%, almost without any loss in precision or recall. at the heart of our approach there is an efficient implementation of a new boolean construct called wand or weak and that might be of independent interest surrounding text:broder, et al. [***]<2>, consider query evaluation in a two stage process. first, the query is run as a boolean and query, where a document is only scored if all terms appear influence:1 type:2 pair index:571 citer id:477 citer title:optimization strategies for complex queries citer abstract:previous research into the efficiency of text retrieval systems has dealt primarily with methods that consider inverted lists in sequence; these methods are known as term-at-a-time methods. however, the literature for optimizing documentata-time systems remains sparse. we present an improvement to the max score optimization, which is the most efficient known document-at-a-time scoring method. like max score, our technique, called term bounded max score, is guaranteed to return exactly the same scores and documents as an unoptimized evaluation, which is particularly useful for query model research. we simulated our technique to explore the problem space, then implemented it in indri, our large scale language modeling search engine. tests with the gov2 corpus on title queries show our method to be 23% faster than max score alone, and 61% faster than our document-at-a-time baseline. our optimized query times are competitive with conventional termata-time systems on this years trec terabyte task citee id:493 citee title:optimization of inverted vector searches citee abstract:a simple algorithm is presented for increasing the efficiency of information retrieval searches which are implemented using inverted files. this optimization algorithm employs knowledge about the methods used for weighting document and query terms in order to examine as few inverted lists as possible. an extension to the basic algorithm allows greatly increased performance optimization at a modest cost in retrieval effectiveness. experimental runs are made examining several different term weighting models and showing the optimization possible with each. surrounding text:this allows the system to maintain a list of the top k scores it has seen so far. one of the earliest papers on modern query optimization comes from buckley [***]<1>. in this approach, terms are evaluated one at a time, from the least frequent term to the most frequent term influence:1 type:2 pair index:572 citer id:477 citer title:optimization strategies for complex queries citer abstract:previous research into the efficiency of text retrieval systems has dealt primarily with methods that consider inverted lists in sequence; these methods are known as term-at-a-time methods. however, the literature for optimizing documentata-time systems remains sparse. we present an improvement to the max score optimization, which is the most efficient known document-at-a-time scoring method. like max score, our technique, called term bounded max score, is guaranteed to return exactly the same scores and documents as an unoptimized evaluation, which is particularly useful for query model research. we simulated our technique to explore the problem space, then implemented it in indri, our large scale language modeling search engine. tests with the gov2 corpus on title queries show our method to be 23% faster than max score alone, and 61% faster than our document-at-a-time baseline. our optimized query times are competitive with conventional termata-time systems on this years trec terabyte task citee id:803 citee title:the inquery retrieval system citee abstract:as larger and more heterogeneous text databases become available, information retrieval research will depend on the development of powerful, efficient and flexible retrieval engines. in this paper, we describe a retrieval system (inquery) that is based on a probabilistic retrieval model and provides support for sophisticated indexing and complex query formulation. inquery has been used successfully with databases containing nearly 400,000 documents. 1 introduction the increasing interest in... surrounding text:motivation our research is motivated by the indri retrieval system, a new language modeling search engine developed at the university of massachusetts amherst. this system incorporates recent work by metzler and croft [6]<1> which combines the language modeling and inference network approaches to information retrieval$ by leveraging this work, indri supports many of the structured query operators from inquery [***]<1>. in addition, indri supports new operators for dealing with document fields, part of speech and named entity tagging, passage retrieval and numeric quantities influence:3 type:3 pair index:573 citer id:477 citer title:optimization strategies for complex queries citer abstract:previous research into the efficiency of text retrieval systems has dealt primarily with methods that consider inverted lists in sequence; these methods are known as term-at-a-time methods. however, the literature for optimizing documentata-time systems remains sparse. we present an improvement to the max score optimization, which is the most efficient known document-at-a-time scoring method. like max score, our technique, called term bounded max score, is guaranteed to return exactly the same scores and documents as an unoptimized evaluation, which is particularly useful for query model research. we simulated our technique to explore the problem space, then implemented it in indri, our large scale language modeling search engine. tests with the gov2 corpus on title queries show our method to be 23% faster than max score alone, and 61% faster than our document-at-a-time baseline. our optimized query times are competitive with conventional termata-time systems on this years trec terabyte task citee id:334 citee title:combining the language model and inference network approaches to retrieval citee abstract:the inference network retrieval model, as implemented in the inquery search engine, allows for richly structured queries. however, it incorporates a form of ad hoc tf.idf estimates for word probabilities. language modeling offers more formal estimation techniques. in this paper we combine the language modeling and inference network approaches into a single framework. the resulting model allows structured queries to be evaluated using language modeling estimates. we explore the issues involved, such as combining beliefs and smoothing of proximity nodes. experimental results are presented comparing the query likelihood model, the inquery system, and our new model. the results reaffirm that high quality structured queries outperform unstructured queries and show that our system consistently achieves higher average precision than inquery. surrounding text:motivation our research is motivated by the indri retrieval system, a new language modeling search engine developed at the university of massachusetts amherst. this system incorporates recent work by metzler and croft [***]<1> which combines the language modeling and inference network approaches to information retrieval$ by leveraging this work, indri supports many of the structured query operators from inquery [4]<1>. in addition, indri supports new operators for dealing with document fields, part of speech and named entity tagging, passage retrieval and numeric quantities influence:2 type:1 pair index:574 citer id:477 citer title:optimization strategies for complex queries citer abstract:previous research into the efficiency of text retrieval systems has dealt primarily with methods that consider inverted lists in sequence; these methods are known as term-at-a-time methods. however, the literature for optimizing documentata-time systems remains sparse. we present an improvement to the max score optimization, which is the most efficient known document-at-a-time scoring method. like max score, our technique, called term bounded max score, is guaranteed to return exactly the same scores and documents as an unoptimized evaluation, which is particularly useful for query model research. we simulated our technique to explore the problem space, then implemented it in indri, our large scale language modeling search engine. tests with the gov2 corpus on title queries show our method to be 23% faster than max score alone, and 61% faster than our document-at-a-time baseline. our optimized query times are competitive with conventional termata-time systems on this years trec terabyte task citee id:43 citee title:self-indexing inverted files for fast text retrieval citee abstract:query processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. here we show that query response time for conjunctive boolean queries and for informal ranked queries can be dramatically reduced, at little cost in terms of storage, by the inclusion of an internal index in each inverted list. this method has been applied in a retrieval system for a collection of nearly two million short documents. our experimental results show that the selfindexing strategy adds less than 20% to the size of the inverted file, but, for boolean queries of 5{10 terms, can reduce processing time to under one fifth of the previous cost. similarly, ranked queries of 40{50 terms can be evaluated in as little as 25% of the previous time, with little or no loss of retrieval e ectiveness surrounding text:buckley suggests a method for doing this. moffat and zobel [***]<2> evaluate two heuristics, quit and continue, which reduce the time necessary to evaluate termata-time queries. the quit heuristic dynamically adds accumulators while query processing continues, until the number of accumulators meets some fixed threshold influence:2 type:2 pair index:575 citer id:477 citer title:optimization strategies for complex queries citer abstract:previous research into the efficiency of text retrieval systems has dealt primarily with methods that consider inverted lists in sequence; these methods are known as term-at-a-time methods. however, the literature for optimizing documentata-time systems remains sparse. we present an improvement to the max score optimization, which is the most efficient known document-at-a-time scoring method. like max score, our technique, called term bounded max score, is guaranteed to return exactly the same scores and documents as an unoptimized evaluation, which is particularly useful for query model research. we simulated our technique to explore the problem space, then implemented it in indri, our large scale language modeling search engine. tests with the gov2 corpus on title queries show our method to be 23% faster than max score alone, and 61% faster than our document-at-a-time baseline. our optimized query times are competitive with conventional termata-time systems on this years trec terabyte task citee id:343 citee title:managing gigabytes: compressing and indexing documents and images citee abstract:in this fully updated second edition of the highly acclaimed managing gigabytes, authors witten, moffat, and bell continue to provide unparalleled coverage of state-of-the-art techniques for compressing and indexing data. whatever your field, if you work with large quantities of information, this book is essential reading--an authoritative theoretical resource and a practical guide to meeting the toughest storage and access challenges. it covers the latest developments in compression and indexing and their application on the web and in digital libraries. it also details dozens of powerful techniques supported by mg, the authors' own system for compressing, storing, and retrieving text, images, and textual images. mg's source code is freely available on the web. surrounding text:3. related work all modern retrieval systems use inverted lists to evaluate queries efficiently [***]<1>. the differences between the optimizations discussed here lie in how these inverted lists are processed influence:2 type:3 pair index:576 citer id:489 citer title:efficient query evaluation using a two-level retrieval process citer abstract:we present an efficient query evaluation method based on a two level approach: at the first level, our method iterates in parallel over query term postings and identifies candidate documents using an approximate evaluation taking into account only partial information on term occurrences and no query independent factors; at the second level, promising candidates are fully evaluated and their exact scores are computed. the efficiency of the evaluation process can be improved significantly using dynamic pruning techniques with very little cost in effectiveness. the amount of pruning can be controlled by the user as a function of time allocated for query evaluation. experimentally, using the trec web track data, we have determined that our algorithm significantly reduces the total number of full evaluations by more than 90%, almost without any loss in precision or recall. at the heart of our approach there is an efficient implementation of a new boolean construct called wand or weak and that might be of independent interest citee id:84 citee title:a fast match for continuous speech recognition using allophonic models citee abstract:in a large vocabulary real-time speech recognition system, there is a need for a fast method for selecting a list of candidate words from the vocabulary that match well with a given acoustic input. in this paper we describe a highly accurate fast acoustic match for continuous speech recognition the algorithm uses allophonic models and efficient search techniques to select a set of candidate words. the allophonic models are derived by constructing decision trees that query the context in which each phone occurs to arrive at an allophone in a given context the models for all the words in the vocabulary are arranged in a tree structiire and efficient tree search algorithms are used to select a list of candidate words using these models. using this method we are able to obtain over 99% accuracy in the fast makh for a continuous speech recognition task which has a vocabulary of 5,000 words. surrounding text:such a two level evaluation approach is commonly used in other application areas to enhance performance: databases sometimes use bloom filters [4]<3> to construct a preliminary join followed by an exact evaluation [17]<3>. speech recognition systems use a fast but approximate acoustic match to decrease the number of extensions to be analyzed by a detailed match [***]<3>. and program committee often use a first rough cut to reduce the number of papers to be fully discussed in the committee influence:3 type:3 pair index:577 citer id:489 citer title:efficient query evaluation using a two-level retrieval process citer abstract:we present an efficient query evaluation method based on a two level approach: at the first level, our method iterates in parallel over query term postings and identifies candidate documents using an approximate evaluation taking into account only partial information on term occurrences and no query independent factors; at the second level, promising candidates are fully evaluated and their exact scores are computed. the efficiency of the evaluation process can be improved significantly using dynamic pruning techniques with very little cost in effectiveness. the amount of pruning can be controlled by the user as a function of time allocated for query evaluation. experimentally, using the trec web track data, we have determined that our algorithm significantly reduces the total number of full evaluations by more than 90%, almost without any loss in precision or recall. at the heart of our approach there is an efficient implementation of a new boolean construct called wand or weak and that might be of independent interest citee id:490 citee title:pivot join: a runtime operator for text search citee abstract:we consider the problem of efficiently sampling web search engine query results. in turn, using a small random sample instead of the full set of results leads to efficient approximate algorithms for several applications, such as: ? determining the set of categories in a given taxonomy spanned by the search results; ? finding the range of metadata values associated to the result set in order to enable multi-faceted search; ? estimating the size of the result set; ? data mining associations to the query terms. we present and analyze an efficient algorithm for obtaining uniform random samples applicable to any search engine based on posting lists and document-at-a-time evaluation. (to our knowledge, all popular web search engines, e.g. google, inktomi, altavista, alltheweb, belong to this class.) furthermore, our algorithm can be modified to follow the modern object-oriented approach whereby posting lists are viewed as streams equipped with a next method, and the next method for boolean and other complex queries is built from the next method for primitive terms. in our case we show how to construct a basic next(p) method that samples term posting lists with probability p, and show how to construct next(p) methods for boolean operators (and, or, wand) from primitive methods. finally, we test the efficiency and quality of our approach on both synthetic and real-world data. surrounding text:identifying an optimal selection strategy is outside the scope of this paper. a recent on-going research on this issue is described in [***]<3>. 2 influence:3 type:3 pair index:578 citer id:489 citer title:efficient query evaluation using a two-level retrieval process citer abstract:we present an efficient query evaluation method based on a two level approach: at the first level, our method iterates in parallel over query term postings and identifies candidate documents using an approximate evaluation taking into account only partial information on term occurrences and no query independent factors; at the second level, promising candidates are fully evaluated and their exact scores are computed. the efficiency of the evaluation process can be improved significantly using dynamic pruning techniques with very little cost in effectiveness. the amount of pruning can be controlled by the user as a function of time allocated for query evaluation. experimentally, using the trec web track data, we have determined that our algorithm significantly reduces the total number of full evaluations by more than 90%, almost without any loss in precision or recall. at the heart of our approach there is an efficient implementation of a new boolean construct called wand or weak and that might be of independent interest citee id:491 citee title:space/time trade-offs in hash coding with allowable errors citee abstract:in this paper trade-offs among certain computational factors in hash coding are analyzed. the paradigm problem considered is that of testing a series of messages one-by-one for membership in a given set of messages. two new hash-coding methods are examined and compared with a particular conventional hash-coding method. the computational factors considered are the size of the hash area (space), the time required to identify a message as a nonmember of the given set (reject time), and an allowable error frequency. the new methods are intended to reduce the amount of space required to contain the hash-coded information from that associated with conventional methods. the reduction in space is accomplished by exploiting the possibility that a small fraction of errors of commission may be tolerable in some applications, in particular, applications in which a large amount of data is involved and a core resident hash area is consequently not feasible using conventional methods. in such applications, it is envisaged that overall performance could be improved by using a smaller core resident hash area in conjunction with the new methods and, when necessary, by using some secondary and perhaps time-consuming test to catch the small fraction of errors associated with the new methods. an example is discussed which illustrates possible areas of application for the new methods. analysis of the paradigm problem demonstrates that allowing a small number of test messages to be falsely identified as members of the given set will permit a much smaller hash area to be used without increasing reject time surrounding text:9% for long queries. such a two level evaluation approach is commonly used in other application areas to enhance performance: databases sometimes use bloom filters [***]<3> to construct a preliminary join followed by an exact evaluation [17]<3>. speech recognition systems use a fast but approximate acoustic match to decrease the number of extensions to be analyzed by a detailed match [2]<3> influence:2 type:3 pair index:579 citer id:489 citer title:efficient query evaluation using a two-level retrieval process citer abstract:we present an efficient query evaluation method based on a two level approach: at the first level, our method iterates in parallel over query term postings and identifies candidate documents using an approximate evaluation taking into account only partial information on term occurrences and no query independent factors; at the second level, promising candidates are fully evaluated and their exact scores are computed. the efficiency of the evaluation process can be improved significantly using dynamic pruning techniques with very little cost in effectiveness. the amount of pruning can be controlled by the user as a function of time allocated for query evaluation. experimentally, using the trec web track data, we have determined that our algorithm significantly reduces the total number of full evaluations by more than 90%, almost without any loss in precision or recall. at the heart of our approach there is an efficient implementation of a new boolean construct called wand or weak and that might be of independent interest citee id:492 citee title:the anatomy of a large-scale hypertextual web search engine citee abstract:in this paper, we present google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. google is designed to crawl and index the web efficiently and produce much more satisfying search results than existing systems. the prototype with a full text and hyperlink database of at least 24 million pages is available at http:h@oogle.stanford.edu/to engineer a search engine is a challenging task. search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. they answer tens of n-tillions of queries every day. despite the importance of large-scale search engines on the web, very little academic research has been done on them. furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago. this paper provides an in-depth description of our large-scale web search engine -- the first such detailed public description we know of to date.apart from the problems of scaling traditional searchtechniques to data of this magnitude, there are new technicalchallenges involved with using the additional information present in hypertext to produce better search results. this paper addresses this question of how to build a practical large-scale system which can exploit the additional infon-nation present in hypertext. also we look at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want. surrounding text:web search engines which must handle context-sensitive queries over very large collections indeed employ daat strategies. see, for instance, [8]<3> that describes altavista [1]<3>, and [***]<3> that describes google [11]<3>. 1 influence:3 type:3 pair index:580 citer id:489 citer title:efficient query evaluation using a two-level retrieval process citer abstract:we present an efficient query evaluation method based on a two level approach: at the first level, our method iterates in parallel over query term postings and identifies candidate documents using an approximate evaluation taking into account only partial information on term occurrences and no query independent factors; at the second level, promising candidates are fully evaluated and their exact scores are computed. the efficiency of the evaluation process can be improved significantly using dynamic pruning techniques with very little cost in effectiveness. the amount of pruning can be controlled by the user as a function of time allocated for query evaluation. experimentally, using the trec web track data, we have determined that our algorithm significantly reduces the total number of full evaluations by more than 90%, almost without any loss in precision or recall. at the heart of our approach there is an efficient implementation of a new boolean construct called wand or weak and that might be of independent interest citee id:493 citee title:optimization of inverted vector searches citee abstract:a simple algorithm is presented for increasing the efficiency of information retrieval searches which are implemented using inverted files. this optimization algorithm employs knowledge about the methods used for weighting document and query terms in order to examine as few inverted lists as possible. an extension to the basic algorithm allows greatly increased performance optimization at a modest cost in retrieval effectiveness. experimental runs are made examining several different term weighting models and showing the optimization possible with each. surrounding text:for a comprehensive overview of optimization techniques see turtle and flood [20]<0>. for taat strate426 gies, the basic idea behind such optimization techniques is to process query terms in some order that lets the system identify the top n scoring documents without processing all query terms [***, 12, 18, 16, 21]<1>. an important optimization technique for daat strategies is termed max-score by turtle and flood influence:2 type:2 pair index:581 citer id:489 citer title:efficient query evaluation using a two-level retrieval process citer abstract:we present an efficient query evaluation method based on a two level approach: at the first level, our method iterates in parallel over query term postings and identifies candidate documents using an approximate evaluation taking into account only partial information on term occurrences and no query independent factors; at the second level, promising candidates are fully evaluated and their exact scores are computed. the efficiency of the evaluation process can be improved significantly using dynamic pruning techniques with very little cost in effectiveness. the amount of pruning can be controlled by the user as a function of time allocated for query evaluation. experimentally, using the trec web track data, we have determined that our algorithm significantly reduces the total number of full evaluations by more than 90%, almost without any loss in precision or recall. at the heart of our approach there is an efficient implementation of a new boolean construct called wand or weak and that might be of independent interest citee id:494 citee title:object-oriented interface for an index citee abstract:a computer implemented method searches an index to locate records of a database using an object oriented interface. each record has a unique address in the database. the index is organized as a plurality of index entries where each index entry including a word and an ordered list of locations where the word occurs in the database. the words represent a unique piece of information of the database. the index entries are ordered first according to the collating order of the words, and second according to the collating order of the locations of each associated word. a query is parsed into terms and operators. each term is associated with a corresponding index entry, the operators relate the terms. a basic stream reader object is generated for each term of the query. the basic stream reader object sequentially reads the locations of the corresponding index entry to determine a target location. a compound stream reader object is generated for each operator. the compound stream reader object references the basic stream reader objects associated with the terms related by the operator. the compound stream reader object produces locations of words within a single record according to the operator. each basic and compound stream reader object is an encapsulation of data references and method references with operate on the data references surrounding text:posting lists are ordered in increasing order of the document identifiers. from a programming point of view, in order to support complex queries over such an inverted index it is convenient to take an object oriented approach [***, 10]<1>. each index term is associated with a basic iterator object (a stream reader object in the terminology of [***]<1>) capable of sequentially iterating over its posting list. from a programming point of view, in order to support complex queries over such an inverted index it is convenient to take an object oriented approach [***, 10]<1>. each index term is associated with a basic iterator object (a stream reader object in the terminology of [***]<1>) capable of sequentially iterating over its posting list. the iterator can additionally skip to a given entry in the posting list influence:2 type:1 pair index:582 citer id:489 citer title:efficient query evaluation using a two-level retrieval process citer abstract:we present an efficient query evaluation method based on a two level approach: at the first level, our method iterates in parallel over query term postings and identifies candidate documents using an approximate evaluation taking into account only partial information on term occurrences and no query independent factors; at the second level, promising candidates are fully evaluated and their exact scores are computed. the efficiency of the evaluation process can be improved significantly using dynamic pruning techniques with very little cost in effectiveness. the amount of pruning can be controlled by the user as a function of time allocated for query evaluation. experimentally, using the trec web track data, we have determined that our algorithm significantly reduces the total number of full evaluations by more than 90%, almost without any loss in precision or recall. at the heart of our approach there is an efficient implementation of a new boolean construct called wand or weak and that might be of independent interest citee id:495 citee title:method for parsing, indexing and searching world-wide-web pages citee abstract:a system indexes web pages of the internet. the pages are stored in computers distributively connected to each other by a communications network. each page has a unique url (universal record locator). some of the pages can include url links to other pages. a communication interface connected to the internet is used for fetching a batch of web pages from the computers in accordance with the urls and url links. the urls are determined by an automated web browser connected to the communications interface. a parser sequentially partitions the batch of specified pages into indexable words where each word represents an indexable portion of information of a specific page, or the word represents an attribute of one or more portions of the specific page. the parser sequentially assigns locations to the words as they are parsed. the locations indicates the unique occurrences of the word in the web. the output of the parser is stored in a memory as an index. the index includes one index entry for each unique word. each index entry also includes one or more location entries indicating where the unique word occurs in the web. a query module parses a query into terms and operators. the operators relate the terms. a search engine uses object-oriented stream readers to sequentially read location of specified index entries, the specified index entries correspond to the terms of a query. a display module presents qualified pages located by the search engine to users of the web surrounding text:web search engines which must handle context-sensitive queries over very large collections indeed employ daat strategies. see, for instance, [***]<3> that describes altavista [1]<3>, and [5]<3> that describes google [11]<3>. 1 influence:2 type:3 pair index:583 citer id:489 citer title:efficient query evaluation using a two-level retrieval process citer abstract:we present an efficient query evaluation method based on a two level approach: at the first level, our method iterates in parallel over query term postings and identifies candidate documents using an approximate evaluation taking into account only partial information on term occurrences and no query independent factors; at the second level, promising candidates are fully evaluated and their exact scores are computed. the efficiency of the evaluation process can be improved significantly using dynamic pruning techniques with very little cost in effectiveness. the amount of pruning can be controlled by the user as a function of time allocated for query evaluation. experimentally, using the trec web track data, we have determined that our algorithm significantly reduces the total number of full evaluations by more than 90%, almost without any loss in precision or recall. at the heart of our approach there is an efficient implementation of a new boolean construct called wand or weak and that might be of independent interest citee id:496 citee title:juru at trec 10 experiments with index pruning citee abstract:this is the first year that juru, a java ir system developed over the past few years at the ibm research lab in haifa, participated in trecs web track. our experiments focused on the ad-hoc tasks. the main goal of our experiments was to validate a novel pruning method, first presented at , that significantly reduces the size of the index with very little influence on the systems precision. by reducing the index size, it becomes feasible to index large text collections such as the web tracks data on low-end machines. furthermore, using our method, web search engines can significantly decrease the burden of storing or backing up extremely large indices by discarding entries that have almost no influence on search results. surrounding text:) 1. 3 experimental results our experiments were done on the juru [***]<3> system, a java search engine developed at the ibm research lab in haifa. juru was used to index the wt10g collection of the trec webtrack [13]<3> and we measured precision and efficiency over a set of 50 webtrack topics. experimental results in this section we report results from experiments which we conducted to evaluate the proposed two-level query evaluation process. for our experiments we used juru [***]<3>, a java search engine, developed at the ibm research lab in haifa. we indexed the wt10g collection of the trec webtrack [13]<3> which contains 10gb of data consisting of 1 influence:2 type:3 pair index:584 citer id:489 citer title:efficient query evaluation using a two-level retrieval process citer abstract:we present an efficient query evaluation method based on a two level approach: at the first level, our method iterates in parallel over query term postings and identifies candidate documents using an approximate evaluation taking into account only partial information on term occurrences and no query independent factors; at the second level, promising candidates are fully evaluated and their exact scores are computed. the efficiency of the evaluation process can be improved significantly using dynamic pruning techniques with very little cost in effectiveness. the amount of pruning can be controlled by the user as a function of time allocated for query evaluation. experimentally, using the trec web track data, we have determined that our algorithm significantly reduces the total number of full evaluations by more than 90%, almost without any loss in precision or recall. at the heart of our approach there is an efficient implementation of a new boolean construct called wand or weak and that might be of independent interest citee id:497 citee title:shortest substring ranking citee abstract:to address the trec-4 topics, we used a precise query language that yields and combines arbitrary intervals of text rather than pre-defined units like words and documents. each solution was scored in inverse proportion to the length of the shortest interval containing it. each document was scored by the sum of the scores of solutions within it. whenever the above strategy yielded less than 1000 documents, documents satisfying successively weaker queries were added with lower rank. our results surrounding text:posting lists are ordered in increasing order of the document identifiers. from a programming point of view, in order to support complex queries over such an inverted index it is convenient to take an object oriented approach [7, ***]<1>. each index term is associated with a basic iterator object (a stream reader object in the terminology of [7]<1>) capable of sequentially iterating over its posting list influence:3 type:1 pair index:585 citer id:489 citer title:efficient query evaluation using a two-level retrieval process citer abstract:we present an efficient query evaluation method based on a two level approach: at the first level, our method iterates in parallel over query term postings and identifies candidate documents using an approximate evaluation taking into account only partial information on term occurrences and no query independent factors; at the second level, promising candidates are fully evaluated and their exact scores are computed. the efficiency of the evaluation process can be improved significantly using dynamic pruning techniques with very little cost in effectiveness. the amount of pruning can be controlled by the user as a function of time allocated for query evaluation. experimentally, using the trec web track data, we have determined that our algorithm significantly reduces the total number of full evaluations by more than 90%, almost without any loss in precision or recall. at the heart of our approach there is an efficient implementation of a new boolean construct called wand or weak and that might be of independent interest citee id:498 citee title:retrieving records from a gigabyte of text on a minicomputer using statistical ranking citee abstract:statistically based ranked retrieval of records using keywords provides many advantages over traditional boolean retrieval methods, especially for end users. this approach to retrieval, however, has not seen widespread use in large operational retrieval systems. to show the feasibility of this retrieval methodology, research was done to produce very fast search techniques using these ranking algorithms, and then to test the results against large databases with many end users. the results show not only response times on the order of 1 and l/2 seconds for 806 megabytes of text, but also very favorable user reaction. novice users were able to consistently obtain good search results after 5 minutes of training. additional work was done to devise new indexing techniques to create inverted files for large databases using a minicomputer. these techniques use no sorting, require a working space of only about 20% of the size of the input text, and produce indices that are about 14% of the input text size. surrounding text:for a comprehensive overview of optimization techniques see turtle and flood [20]<0>. for taat strate426 gies, the basic idea behind such optimization techniques is to process query terms in some order that lets the system identify the top n scoring documents without processing all query terms [6, ***, 18, 16, 21]<1>. an important optimization technique for daat strategies is termed max-score by turtle and flood influence:2 type:2 pair index:586 citer id:489 citer title:efficient query evaluation using a two-level retrieval process citer abstract:we present an efficient query evaluation method based on a two level approach: at the first level, our method iterates in parallel over query term postings and identifies candidate documents using an approximate evaluation taking into account only partial information on term occurrences and no query independent factors; at the second level, promising candidates are fully evaluated and their exact scores are computed. the efficiency of the evaluation process can be improved significantly using dynamic pruning techniques with very little cost in effectiveness. the amount of pruning can be controlled by the user as a function of time allocated for query evaluation. experimentally, using the trec web track data, we have determined that our algorithm significantly reduces the total number of full evaluations by more than 90%, almost without any loss in precision or recall. at the heart of our approach there is an efficient implementation of a new boolean construct called wand or weak and that might be of independent interest citee id:499 citee title:term-ordered query evaluation versus document-ordered query evaluation for large document databases citee abstract:there are two main families of technique for eficient processing of ranked queries on large text collections: document-ordered processing and term-ordered processing. in this note we compare these techniques experimentally. we show that they have similar costs for short queries, but that for long queries document-ordered processing is much more costly. overall, we conclude that term-ordered processing, with the refinements of limited accumulators and hierarchical index structuring, is the more eficient mechanism. surrounding text:turtle and flood demonstrated experimentally that in an environment where it is not possible to store intermediate scores in main memory, optimized daat strategies outperform optimized taat strategies. in contrast, kaszkiel and zobbel [***]<2> showed that for long queries, the max-score strategy is much more costly than optimized taat strategies, due to the required sorting of term postings at each stage of evaluation. a process that heavily depends on the number of query terms influence:1 type:2 pair index:587 citer id:489 citer title:efficient query evaluation using a two-level retrieval process citer abstract:we present an efficient query evaluation method based on a two level approach: at the first level, our method iterates in parallel over query term postings and identifies candidate documents using an approximate evaluation taking into account only partial information on term occurrences and no query independent factors; at the second level, promising candidates are fully evaluated and their exact scores are computed. the efficiency of the evaluation process can be improved significantly using dynamic pruning techniques with very little cost in effectiveness. the amount of pruning can be controlled by the user as a function of time allocated for query evaluation. experimentally, using the trec web track data, we have determined that our algorithm significantly reduces the total number of full evaluations by more than 90%, almost without any loss in precision or recall. at the heart of our approach there is an efficient implementation of a new boolean construct called wand or weak and that might be of independent interest citee id:500 citee title:full text indexing based on lexical relations: an application: software libraries citee abstract:in contrast to other kinds of libraries, software libraries need to be conceptually organized. when looking for a component, the main concern of users is the functionality of the desired component; implementation details are secondary. software reuse would be enhanced with conceptually organized large libraries of software components. in this paper, we present guru, a tool that allows automatical building of such large software libraries from documented software components. we focus here on guru's indexing component which extracts conceptual attributes from natural language documentation. this indexing method is based on words' co-occurrences. it first uses extract, a co-occurrence knowledge compiler for extracting potential attributes from textual documents. conceptually relevant collocations are then selected according to their resolving power, which scales down the noise due to context words. this fully automated indexing tool thus goes further than keyword-based tools in the understanding of a document without the brittleness of knowledge based tools. the indexing component of guru is fully implemented, and some results are given in the paper surrounding text:. lexical affinities: lexical affinities (las) are terms found in close proximity to each other, in a window of small size [***]<3>. the posting iterator of an la term receives as input the posting iterators of both la terms and returns only documents containing both terms in close proximity influence:2 type:3 pair index:588 citer id:489 citer title:efficient query evaluation using a two-level retrieval process citer abstract:we present an efficient query evaluation method based on a two level approach: at the first level, our method iterates in parallel over query term postings and identifies candidate documents using an approximate evaluation taking into account only partial information on term occurrences and no query independent factors; at the second level, promising candidates are fully evaluated and their exact scores are computed. the efficiency of the evaluation process can be improved significantly using dynamic pruning techniques with very little cost in effectiveness. the amount of pruning can be controlled by the user as a function of time allocated for query evaluation. experimentally, using the trec web track data, we have determined that our algorithm significantly reduces the total number of full evaluations by more than 90%, almost without any loss in precision or recall. at the heart of our approach there is an efficient implementation of a new boolean construct called wand or weak and that might be of independent interest citee id:501 citee title:fast ranking in limited space citee abstract:ranking techniques have long been suggested as alternatives to more conventionalboolean methods for searching document collectionsfi the costof computing a ranking is, however, greater than the cost of performing aboolean search, in terms of both memory space and processing timefi herewe consider the resources required by the cosine method of ranking, andshow that with a careful application of indexing and selection techniquesboth the space and time required by ranking can be substantially surrounding text:for a comprehensive overview of optimization techniques see turtle and flood [20]<0>. for taat strate426 gies, the basic idea behind such optimization techniques is to process query terms in some order that lets the system identify the top n scoring documents without processing all query terms [6, 12, 18, ***, 21]<1>. an important optimization technique for daat strategies is termed max-score by turtle and flood influence:3 type:2 pair index:589 citer id:489 citer title:efficient query evaluation using a two-level retrieval process citer abstract:we present an efficient query evaluation method based on a two level approach: at the first level, our method iterates in parallel over query term postings and identifies candidate documents using an approximate evaluation taking into account only partial information on term occurrences and no query independent factors; at the second level, promising candidates are fully evaluated and their exact scores are computed. the efficiency of the evaluation process can be improved significantly using dynamic pruning techniques with very little cost in effectiveness. the amount of pruning can be controlled by the user as a function of time allocated for query evaluation. experimentally, using the trec web track data, we have determined that our algorithm significantly reduces the total number of full evaluations by more than 90%, almost without any loss in precision or recall. at the heart of our approach there is an efficient implementation of a new boolean construct called wand or weak and that might be of independent interest citee id:502 citee title:optimal semijoins for distributed database systems citee abstract:a bloom-filter-based semijoin algorithm for distributed database systems is presentedfi this algorithm reduces communications costs to process a distributed natural join as much as possible with a filter approachfi an optimal filter is developed in piecesfi filter information is used both to recognize when the semijoin will cease to be effective and to optimally process the semijoinfi an ineffective semijoin will be quickly and cheaply recognizedfi an effective semijoin will use all of the transmitted bits optimallyfi surrounding text:9% for long queries. such a two level evaluation approach is commonly used in other application areas to enhance performance: databases sometimes use bloom filters [4]<3> to construct a preliminary join followed by an exact evaluation [***]<3>. speech recognition systems use a fast but approximate acoustic match to decrease the number of extensions to be analyzed by a detailed match [2]<3> influence:3 type:3 pair index:590 citer id:489 citer title:efficient query evaluation using a two-level retrieval process citer abstract:we present an efficient query evaluation method based on a two level approach: at the first level, our method iterates in parallel over query term postings and identifies candidate documents using an approximate evaluation taking into account only partial information on term occurrences and no query independent factors; at the second level, promising candidates are fully evaluated and their exact scores are computed. the efficiency of the evaluation process can be improved significantly using dynamic pruning techniques with very little cost in effectiveness. the amount of pruning can be controlled by the user as a function of time allocated for query evaluation. experimentally, using the trec web track data, we have determined that our algorithm significantly reduces the total number of full evaluations by more than 90%, almost without any loss in precision or recall. at the heart of our approach there is an efficient implementation of a new boolean construct called wand or weak and that might be of independent interest citee id:503 citee title:overview of the tenth text retrieval conference (trec-10) citee abstract:trec-2001 saw the falling into abeyance of the large web task but a strengthening and broadening of activities based on the 1.69 million page wt10g corpus. there were two tasks. the topic relevance task was like traditional trec ad hoc but used queries taken from real web search logs from which description and narrative elds of a topic description were inferred by the topic developers. there were 50 topics. in the homepage nding task queries corresponded to the name of an entity whose home surrounding text:juru was used to index the wt10g collection of the trec webtrack [13]<3> and we measured precision and efficiency over a set of 50 webtrack topics. precision was measured by precision at 10 (p@10) and mean average precision (map) [***]<3>. the main method for estimating the efficiency of our approach is counting the number of full evaluations required to return a certain number of top results. . search precision as measured by precision at 10 (p@10) and mean average precision (map) [***]<3>. influence:3 type:3 pair index:591 citer id:518 citer title:evaluating top-k queries overweb-accessible databases with global page ordering citer abstract:a query to a web search engine usually consists of a list of keywords, to which the search engine responds with the best or top k pages for the query. this top-k query model is prevalent over multimedia collections in general, but also over plain relational data for certain applications. for example, consider a relation with information on available restaurants, including their location, price range for one diner, and overall food rating. a user who queries such a relation might simply specify the users location and target price range, and expect in return the best 10 restaurants in terms of some combination of proximity to the user, closeness of match to the target price range, and overall food rating. processing such top-k queries efficiently is challenging for a number of reasons. one critical such reason is that, in many web applications, the relation attributes might not be available other than through external web-accessible form interfaces, which we will have to query repeatedly for a potentially large set of candidate objects. in this paper, we study how to process topk queries efficiently in this setting, where the attributes for which users specify target values might be handled by external, autonomous sources with a variety of access interfaces. we present several algorithms for processing such queries, and evaluate them thoroughly using both synthetic and real web-accessible data. citee id:249 citee title:on saying "enough already!" in sql citee abstract: in this paper, we study a simple sql extension that enables query writers to explicitly limit the cardinality of a query result we examine its impact on the query optimization and run - time execution components of a relational dbms, presenting two approaches - a conservative approach and an aggressive approach - to exploiting cardinality limits in relational query plans results obtained from an empirical study conducted using db2 demonstrate the benefits of the sql extension and illustrate the tradeoffs between our two approaches to implementing it surrounding text:2. over relational databases, carey and kossmann [***, 2]<2> presented techniques to optimize top-k queries when the scoring is done through a traditional sql order-by clause. donjerkovic and ramakrishnan [5]<2> proposed a probabilistic approach to top-k query optimization influence:2 type:2 pair index:592 citer id:518 citer title:evaluating top-k queries overweb-accessible databases with global page ordering citer abstract:a query to a web search engine usually consists of a list of keywords, to which the search engine responds with the best or top k pages for the query. this top-k query model is prevalent over multimedia collections in general, but also over plain relational data for certain applications. for example, consider a relation with information on available restaurants, including their location, price range for one diner, and overall food rating. a user who queries such a relation might simply specify the users location and target price range, and expect in return the best 10 restaurants in terms of some combination of proximity to the user, closeness of match to the target price range, and overall food rating. processing such top-k queries efficiently is challenging for a number of reasons. one critical such reason is that, in many web applications, the relation attributes might not be available other than through external web-accessible form interfaces, which we will have to query repeatedly for a potentially large set of candidate objects. in this paper, we study how to process topk queries efficiently in this setting, where the attributes for which users specify target values might be handled by external, autonomous sources with a variety of access interfaces. we present several algorithms for processing such queries, and evaluate them thoroughly using both synthetic and real web-accessible data. citee id:250 citee title:reducing the braking distance of an sql query engine citee abstract:in a recent paper, we proposed adding a stop after clause tosql to permit the cardinality of a query result to be explicitly limitedby query writers and query toolsfi we demonstrated the usefulnessof having this clause, showed how to extend a traditionalcost-based query optimizer to accommodate it, and demonstratedvia db2-based simulations that large performance gains are possiblewhen stop after queries are explicitly supported by thedatabase enginefi in this paper, we present several new surrounding text:2. over relational databases, carey and kossmann [1, ***]<2> presented techniques to optimize top-k queries when the scoring is done through a traditional sql order-by clause. donjerkovic and ramakrishnan [5]<2> proposed a probabilistic approach to top-k query optimization influence:2 type:2 pair index:593 citer id:518 citer title:evaluating top-k queries overweb-accessible databases with global page ordering citer abstract:a query to a web search engine usually consists of a list of keywords, to which the search engine responds with the best or top k pages for the query. this top-k query model is prevalent over multimedia collections in general, but also over plain relational data for certain applications. for example, consider a relation with information on available restaurants, including their location, price range for one diner, and overall food rating. a user who queries such a relation might simply specify the users location and target price range, and expect in return the best 10 restaurants in terms of some combination of proximity to the user, closeness of match to the target price range, and overall food rating. processing such top-k queries efficiently is challenging for a number of reasons. one critical such reason is that, in many web applications, the relation attributes might not be available other than through external web-accessible form interfaces, which we will have to query repeatedly for a potentially large set of candidate objects. in this paper, we study how to process topk queries efficiently in this setting, where the attributes for which users specify target values might be handled by external, autonomous sources with a variety of access interfaces. we present several algorithms for processing such queries, and evaluate them thoroughly using both synthetic and real web-accessible data. citee id:519 citee title:optimizing queries over multimedia repositories citee abstract:repositories of multimedia objects having multiple types of attributes (e.g., image, text) are becoming increasingly common. a selection on these attributes will typically produce not just a set of objects, as in the traditional relational query model (filtering), but also a grade of match associated with each object, indicating how well the object matches the selection condition (ranking). also, multimedia repositories may allow access to the attributes of each object only through indexes. we investigate how to optimize the processing of queries over multimedia repositories. a key issue is the choice of the indexes used to search the repository. we define an execution space that is search-minimal, i.e., the set of indexes searched is minimal. although the general problem of picking an optimal plan in the search-minimal execution space is np-hard, we solve the problem efficiently when the predicates in the query are independent. we also show that the problem of optimizing queries that ask for a few top-ranked objects can be viewed, in many cases, as that of evaluating selection conditions. thus, both problems can be viewed together as an extended filtering problem. surrounding text:the mars system [15]<2> also uses variations of the fa algorithm and views queries as binary trees where the leaves are single-attribute queries and the internal nodes correspond to fuzzy query operators. chaudhuri and gravano also built on fagins original fa algorithm and proposed a cost-based approach for optimizing the execution of top-k queries over multimedia repositories [***]<2>. their strategy translates a given top-k query into a selection query that returns a (hopefully tight) superset of the actual top-k tuples influence:2 type:2 pair index:594 citer id:518 citer title:evaluating top-k queries overweb-accessible databases with global page ordering citer abstract:a query to a web search engine usually consists of a list of keywords, to which the search engine responds with the best or top k pages for the query. this top-k query model is prevalent over multimedia collections in general, but also over plain relational data for certain applications. for example, consider a relation with information on available restaurants, including their location, price range for one diner, and overall food rating. a user who queries such a relation might simply specify the users location and target price range, and expect in return the best 10 restaurants in terms of some combination of proximity to the user, closeness of match to the target price range, and overall food rating. processing such top-k queries efficiently is challenging for a number of reasons. one critical such reason is that, in many web applications, the relation attributes might not be available other than through external web-accessible form interfaces, which we will have to query repeatedly for a potentially large set of candidate objects. in this paper, we study how to process topk queries efficiently in this setting, where the attributes for which users specify target values might be handled by external, autonomous sources with a variety of access interfaces. we present several algorithms for processing such queries, and evaluate them thoroughly using both synthetic and real web-accessible data. citee id:520 citee title:evaluating top-k selection queries citee abstract:in many applications, users specify target values for certain attributes, without requiring exact matches to these values in returnfi instead, the result to such queries is typically a rank of the "top k" tuples that best match the given attribute valuesfi in this paper, we study the advantages and limitations of processing a top-k query by translating it into a single range query that traditional relational dbmss can process e#cientlyfi in particular, we study how to determine a range query to surrounding text:if price and quality are more important to a given user than the location of the restaurant, then the query might assign, say, a 0:2 weight to attribute address, and a 0:4 weight to attributes price and rating. recent techniques to evaluate top-k queries over traditional relational dbmss [***, 5]<2> assume that all attributes of every object are readily available to the query processor, however, in many applications some attributes might not be available locally, but rather will have to be obtained from an external web-accessible source instead. for instance, the price attribute in our example is provided by the nytreview web site and can only be accessed by querying this sites web interface 5. donjerkovic and ramakrishnan [5]<2> proposed a probabilistic approach to top-k query optimization. finally, chaudhuri and gravano [***]<2> exploited multidimensional histograms to process top-k queries over an unmodified relational dbms by mapping top-k queries into traditional selection queries. additional related work includes the prefer system [11]<2>, which uses pre-materialized views to efficiently answer ranked preference queries over commercial dbmss influence:2 type:2 pair index:595 citer id:518 citer title:evaluating top-k queries overweb-accessible databases with global page ordering citer abstract:a query to a web search engine usually consists of a list of keywords, to which the search engine responds with the best or top k pages for the query. this top-k query model is prevalent over multimedia collections in general, but also over plain relational data for certain applications. for example, consider a relation with information on available restaurants, including their location, price range for one diner, and overall food rating. a user who queries such a relation might simply specify the users location and target price range, and expect in return the best 10 restaurants in terms of some combination of proximity to the user, closeness of match to the target price range, and overall food rating. processing such top-k queries efficiently is challenging for a number of reasons. one critical such reason is that, in many web applications, the relation attributes might not be available other than through external web-accessible form interfaces, which we will have to query repeatedly for a potentially large set of candidate objects. in this paper, we study how to process topk queries efficiently in this setting, where the attributes for which users specify target values might be handled by external, autonomous sources with a variety of access interfaces. we present several algorithms for processing such queries, and evaluate them thoroughly using both synthetic and real web-accessible data. citee id:521 citee title:probabilistic optimization of top n queries citee abstract:the problem of finding the best answers to a query quickly, rather than finding all answers, is of increasing importance as relational databases are applied in multimedia and decision-support domains. an approach to efficiently answering such "top n" queries is to augment the query with an additional selection that prunes away the unwanted portion of the answer set. the risk is that if the selection returns fewer than the desired number of answers, the execution must be restarted surrounding text:if price and quality are more important to a given user than the location of the restaurant, then the query might assign, say, a 0:2 weight to attribute address, and a 0:4 weight to attributes price and rating. recent techniques to evaluate top-k queries over traditional relational dbmss [4, ***]<2> assume that all attributes of every object are readily available to the query processor, however, in many applications some attributes might not be available locally, but rather will have to be obtained from an external web-accessible source instead. for instance, the price attribute in our example is provided by the nytreview web site and can only be accessed by querying this sites web interface 5. over relational databases, carey and kossmann [1, 2]<2> presented techniques to optimize top-k queries when the scoring is done through a traditional sql order-by clause. donjerkovic and ramakrishnan [***]<2> proposed a probabilistic approach to top-k query optimization. finally, chaudhuri and gravano [4]<2> exploited multidimensional histograms to process top-k queries over an unmodified relational dbms by mapping top-k queries into traditional selection queries influence:2 type:2 pair index:596 citer id:518 citer title:evaluating top-k queries overweb-accessible databases with global page ordering citer abstract:a query to a web search engine usually consists of a list of keywords, to which the search engine responds with the best or top k pages for the query. this top-k query model is prevalent over multimedia collections in general, but also over plain relational data for certain applications. for example, consider a relation with information on available restaurants, including their location, price range for one diner, and overall food rating. a user who queries such a relation might simply specify the users location and target price range, and expect in return the best 10 restaurants in terms of some combination of proximity to the user, closeness of match to the target price range, and overall food rating. processing such top-k queries efficiently is challenging for a number of reasons. one critical such reason is that, in many web applications, the relation attributes might not be available other than through external web-accessible form interfaces, which we will have to query repeatedly for a potentially large set of candidate objects. in this paper, we study how to process topk queries efficiently in this setting, where the attributes for which users specify target values might be handled by external, autonomous sources with a variety of access interfaces. we present several algorithms for processing such queries, and evaluate them thoroughly using both synthetic and real web-accessible data. citee id:330 citee title:combining fuzzy information from multiple systems citee abstract:in a traditional database system, the result of a query is a set of values (those values that satisfy the query). in other data servers, such as a system with queries based on image content, or many text retrieval systems, the result of a query is a sorted list. for example, in the case of a system with queries based on image content, the query might ask for objects that are a particular shade of red, and the result of the query would be a sorted list of objects in the database, sorted by how ... surrounding text:) following fagin et al. [***, 7]<1>, we do not allow our algorithms to rely on wild guesses: thus a random access cannot zoom in on a previously unseen object, i$e$, on an object that has not been previously retrieved under sorted access from a source. therefore, an object will have to be retrieved from the s-source before being probed on any r-source influence:2 type:1 pair index:597 citer id:518 citer title:evaluating top-k queries overweb-accessible databases with global page ordering citer abstract:a query to a web search engine usually consists of a list of keywords, to which the search engine responds with the best or top k pages for the query. this top-k query model is prevalent over multimedia collections in general, but also over plain relational data for certain applications. for example, consider a relation with information on available restaurants, including their location, price range for one diner, and overall food rating. a user who queries such a relation might simply specify the users location and target price range, and expect in return the best 10 restaurants in terms of some combination of proximity to the user, closeness of match to the target price range, and overall food rating. processing such top-k queries efficiently is challenging for a number of reasons. one critical such reason is that, in many web applications, the relation attributes might not be available other than through external web-accessible form interfaces, which we will have to query repeatedly for a potentially large set of candidate objects. in this paper, we study how to process topk queries efficiently in this setting, where the attributes for which users specify target values might be handled by external, autonomous sources with a variety of access interfaces. we present several algorithms for processing such queries, and evaluate them thoroughly using both synthetic and real web-accessible data. citee id:254 citee title:optimal aggregation algorithms for middleware citee abstract:assume that each object in a database has m grades, or scores, one for eachof m attributes. for example, an object can have a color grade, that tells how red it is, and a shape grade, that tells how round it is. for each attribute, there is a sorted list, which lists each object and its grade under that attribute, sorted by grade (highest grade first). each object is assigned an overall grade, that is obtained by combining the attribute grades using a fixed monotone aggregation function, or combining rule, suchas min or average. to determine the top k objects, that is, k objects with the highest overall grades, the naive algorithm must access every object in the database, to find its grade under each attribute. fagin has given an algorithm (fagins algorithm, or fa) that is much more efficient. for some monotone aggregation functions, fa is optimal with high probability in the worst case. we analyze an elegant and remarkably simple algorithm (the threshold algorithm, or ta) that is optimal in a much stronger sense than fa. we show that ta is essentially optimal, not just for some monotone aggregation functions, but for all of them, and not just in a high-probability worst-case sense, but over every database. unlike fa, which requires large buffers (whose size may grow unboundedly as the database size grows), ta requires only a small, constant-size buffer. ta allows early stopping, which yields, in a precise sense, an approximate version of the top k answers. we distinguish two types of access: sorted access (where the middleware system obtains the grade of an object in some sorted list by proceeding through the list sequentially from the top), and random access (where the middleware system requests the grade of object in a list, and obtains it in one step). we consider the scenarios where random access is either impossible, or expensive relative to sorted access, and provide algorithms that are essentially optimal for these cases as well. r 2003 elsevier science (usa). all rights reserved. surrounding text:hence, we interact with three autonomous sources and repeatedly query them for a potentially large set of candidate restaurants. recently, fagin et al$ [***]<2> have presented query processing algorithms for top-k queries for the case where all intervening sources support sorted access (plus perhaps random access as well). unfortunately, these algorithms are not designed for sources that only support random access (e. fagin et al. [***]<2> presented instance optimal query processing algorithms over sources that are either of type sr-source (ta algorithm) or of type s-source (nra algorithm). as we will see, simple adaptations of these algorithms do not perform as well for the common scenario where r-source sources are also available. to process queries involving multiple multimedia attributes, fagin et al. proposed a family of algorithms [6, ***]<2>, developed as part of ibm almadens garlic project. these algorithms can evaluate top-k queries that involve several independent multimedia subsystems, each producing scores that are combined using arbitrary monotonic aggregation functions influence:1 type:2 pair index:598 citer id:518 citer title:evaluating top-k queries overweb-accessible databases with global page ordering citer abstract:a query to a web search engine usually consists of a list of keywords, to which the search engine responds with the best or top k pages for the query. this top-k query model is prevalent over multimedia collections in general, but also over plain relational data for certain applications. for example, consider a relation with information on available restaurants, including their location, price range for one diner, and overall food rating. a user who queries such a relation might simply specify the users location and target price range, and expect in return the best 10 restaurants in terms of some combination of proximity to the user, closeness of match to the target price range, and overall food rating. processing such top-k queries efficiently is challenging for a number of reasons. one critical such reason is that, in many web applications, the relation attributes might not be available other than through external web-accessible form interfaces, which we will have to query repeatedly for a potentially large set of candidate objects. in this paper, we study how to process topk queries efficiently in this setting, where the attributes for which users specify target values might be handled by external, autonomous sources with a variety of access interfaces. we present several algorithms for processing such queries, and evaluate them thoroughly using both synthetic and real web-accessible data. citee id:522 citee title:wsq/dsq: a practical approach for combined querying of databases and the web citee abstract:we present wsq/dsq (pronounced "wisk-disk"), a new approach for combining the query facilitiesof traditional databases with existing search engines on the webfi wsq, for web-supported (database)queries, leverages results from web searches to enhance sql queries over a relational databasefi dsq,for database-supported (web) queries, uses information stored in the database to enhance and explainweb searchesfi this paper focuses primarily on wsq, describing a simple, low-overhead way to surrounding text:we approximate the popularity of a restaurant with the number of web pages that mention the restaurant, as reported by the altavista search engine. (the idea of using web search engines as a popularity oracle has been used before in the wsq/dsq system [***]<1>. ) table 1 summarizes these sources and their interfaces. ) another assumption is that only one source can be accessed at a time, which is too restrictive in the context of web sources. as explained in section 6, we can incorporate the ideas in [***]<1> to include parallelism and speed up query processing. acknowledgments this material is based upon work supported in part by the national science foundation under grants no influence:2 type:1,3 pair index:599 citer id:518 citer title:evaluating top-k queries overweb-accessible databases with global page ordering citer abstract:a query to a web search engine usually consists of a list of keywords, to which the search engine responds with the best or top k pages for the query. this top-k query model is prevalent over multimedia collections in general, but also over plain relational data for certain applications. for example, consider a relation with information on available restaurants, including their location, price range for one diner, and overall food rating. a user who queries such a relation might simply specify the users location and target price range, and expect in return the best 10 restaurants in terms of some combination of proximity to the user, closeness of match to the target price range, and overall food rating. processing such top-k queries efficiently is challenging for a number of reasons. one critical such reason is that, in many web applications, the relation attributes might not be available other than through external web-accessible form interfaces, which we will have to query repeatedly for a potentially large set of candidate objects. in this paper, we study how to process topk queries efficiently in this setting, where the attributes for which users specify target values might be handled by external, autonomous sources with a variety of access interfaces. we present several algorithms for processing such queries, and evaluate them thoroughly using both synthetic and real web-accessible data. citee id:465 citee title:optimizing multi-feature queries for image databases citee abstract:: in digital libraries image retrieval queries can bebased on the similarity of objects, using severalfeature attributes like shape, texture, color or textfi surrounding text:nepal and ramakrishna [14]<2> and guntzer et al. [***]<2> presented variations of fagins original fa algorithm [6]<2> for processing queries over multimedia databases. in particular, guntzer et al. in particular, guntzer et al. [***]<2> reduce the number of random accesses through the introduction of more stop-condition tests and by exploiting the data distribution. the mars system [15]<2> also uses variations of the fa algorithm and views queries as binary trees where the leaves are single-attribute queries and the internal nodes correspond to fuzzy query operators influence:2 type:2 pair index:600 citer id:518 citer title:evaluating top-k queries overweb-accessible databases with global page ordering citer abstract:a query to a web search engine usually consists of a list of keywords, to which the search engine responds with the best or top k pages for the query. this top-k query model is prevalent over multimedia collections in general, but also over plain relational data for certain applications. for example, consider a relation with information on available restaurants, including their location, price range for one diner, and overall food rating. a user who queries such a relation might simply specify the users location and target price range, and expect in return the best 10 restaurants in terms of some combination of proximity to the user, closeness of match to the target price range, and overall food rating. processing such top-k queries efficiently is challenging for a number of reasons. one critical such reason is that, in many web applications, the relation attributes might not be available other than through external web-accessible form interfaces, which we will have to query repeatedly for a potentially large set of candidate objects. in this paper, we study how to process topk queries efficiently in this setting, where the attributes for which users specify target values might be handled by external, autonomous sources with a variety of access interfaces. we present several algorithms for processing such queries, and evaluate them thoroughly using both synthetic and real web-accessible data. citee id:523 citee title:predicate migration: optimizing queries with expensive predicates citee abstract:the traditional focus of relational query optimization schemes has been on the choice of join methods and join orders. restrictions have typically been handled in query optimizers by "predicate pushdown" rules, which apply restrictions in some random order before as many joins as possible. these rules work under the assumption that restrictions is essentially a zero- time operation. however, today's extensible and object-oriented database systems allow users to define time-consuming functions, which may be used in a query's restriction and join predicates. furthermore, sql has long supported subquery predicates, which may be arbitrarily time-consuming to check. thus restrictions should not be considered zero-time operations, and the model of query optimization must be enhanced. in this paper we develop a theory for moving expensive predicates in a query plan so that the total cost of the plan--including the costs of both joins and restrictions--is minimal. we present an algorithm to implement the theory, as well as results of our implementation in postgres. our experience with the newly enhanced postgres are orders of magnitude faster than plans generated by a traditional query optimizer. the additional complexity of considering expensive predicates during optimization is found to be manageably small. surrounding text:3. 2 exploiting techniques for processing selections with expensive predicates work on expensive-predicate query optimization [***, 12]<1> has studied how to process selection queries of the form p1 ^ : : : ^ pn, where each predicate pi can be expensive to calculate. the key idea is to order the evaluation of predicates to minimize the expected execution time influence:2 type:1 pair index:601 citer id:518 citer title:evaluating top-k queries overweb-accessible databases with global page ordering citer abstract:a query to a web search engine usually consists of a list of keywords, to which the search engine responds with the best or top k pages for the query. this top-k query model is prevalent over multimedia collections in general, but also over plain relational data for certain applications. for example, consider a relation with information on available restaurants, including their location, price range for one diner, and overall food rating. a user who queries such a relation might simply specify the users location and target price range, and expect in return the best 10 restaurants in terms of some combination of proximity to the user, closeness of match to the target price range, and overall food rating. processing such top-k queries efficiently is challenging for a number of reasons. one critical such reason is that, in many web applications, the relation attributes might not be available other than through external web-accessible form interfaces, which we will have to query repeatedly for a potentially large set of candidate objects. in this paper, we study how to process topk queries efficiently in this setting, where the attributes for which users specify target values might be handled by external, autonomous sources with a variety of access interfaces. we present several algorithms for processing such queries, and evaluate them thoroughly using both synthetic and real web-accessible data. citee id:524 citee title:prefer: a system for the efficient execution of multi-parametric ranked queries citee abstract:users often need to optimize the selection of objects by ap-propriately weighting the importance of multiple object at-tributes. such optimization problems appear often in opera-tions research and applied mathematics as well as everydaylife; e.g., a buyer may select a home as a weighted functionof a number of attributes like its distance from office, itsprice, its area, etc.we capture such queries in our definition of preferencequeries that use a weight function over a relations attributesto derive a score for each tuple. database systems cannotefficiently produce the top results of a preference query be-cause they need to evaluate the weight function over all tu-ples of the relation. prefer answers preference queriesefficiently by using materialized views that have been pre-processed and stored.we first show how the result of a preference query can beproduced in a pipelined fashion using a materialized view.then we show that excellent performance can be deliveredgiven a reasonable number of materialized views and we pro-vide an algorithm that selects a number of views to precom-pute and materialize given space constraints.we have implemented the algorithms proposed in this pa-per in a prototype system called prefer, which operateson top of a commercial database management system. wepresent the results of a performance comparison, compar-ing our algorithms with prior approaches using syntheticdatasets. our results indicate that the proposed algorithmsare superior in performance compared to other approaches,both in preprocessing (preparation of materialized views) aswell as execution time. surrounding text:finally, chaudhuri and gravano [4]<2> exploited multidimensional histograms to process top-k queries over an unmodified relational dbms by mapping top-k queries into traditional selection queries. additional related work includes the prefer system [***]<2>, which uses pre-materialized views to efficiently answer ranked preference queries over commercial dbmss. recently, natsev et al influence:3 type:2 pair index:602 citer id:518 citer title:evaluating top-k queries overweb-accessible databases with global page ordering citer abstract:a query to a web search engine usually consists of a list of keywords, to which the search engine responds with the best or top k pages for the query. this top-k query model is prevalent over multimedia collections in general, but also over plain relational data for certain applications. for example, consider a relation with information on available restaurants, including their location, price range for one diner, and overall food rating. a user who queries such a relation might simply specify the users location and target price range, and expect in return the best 10 restaurants in terms of some combination of proximity to the user, closeness of match to the target price range, and overall food rating. processing such top-k queries efficiently is challenging for a number of reasons. one critical such reason is that, in many web applications, the relation attributes might not be available other than through external web-accessible form interfaces, which we will have to query repeatedly for a potentially large set of candidate objects. in this paper, we study how to process topk queries efficiently in this setting, where the attributes for which users specify target values might be handled by external, autonomous sources with a variety of access interfaces. we present several algorithms for processing such queries, and evaluate them thoroughly using both synthetic and real web-accessible data. citee id:525 citee title:optimizing disjunctive queries with expensive predicates citee abstract:in this work, we propose and assess a technique called bypassprocessing for optimizing the evaluation of disjunctivequeries with expensive predicatesfi the technique is particularlyuseful for optimizing selection predicates that containterms whose evaluation costs vary tremendously; efigfi,the evaluation of a nested subquery or the invocation of auser-defined function in an object-oriented or extended relationalmodel may be orders of magnitude more expensivethan an attribute access (and surrounding text:3. 2 exploiting techniques for processing selections with expensive predicates work on expensive-predicate query optimization [10, ***]<1> has studied how to process selection queries of the form p1 ^ : : : ^ pn, where each predicate pi can be expensive to calculate. the key idea is to order the evaluation of predicates to minimize the expected execution time influence:2 type:1 pair index:603 citer id:518 citer title:evaluating top-k queries overweb-accessible databases with global page ordering citer abstract:a query to a web search engine usually consists of a list of keywords, to which the search engine responds with the best or top k pages for the query. this top-k query model is prevalent over multimedia collections in general, but also over plain relational data for certain applications. for example, consider a relation with information on available restaurants, including their location, price range for one diner, and overall food rating. a user who queries such a relation might simply specify the users location and target price range, and expect in return the best 10 restaurants in terms of some combination of proximity to the user, closeness of match to the target price range, and overall food rating. processing such top-k queries efficiently is challenging for a number of reasons. one critical such reason is that, in many web applications, the relation attributes might not be available other than through external web-accessible form interfaces, which we will have to query repeatedly for a potentially large set of candidate objects. in this paper, we study how to process topk queries efficiently in this setting, where the attributes for which users specify target values might be handled by external, autonomous sources with a variety of access interfaces. we present several algorithms for processing such queries, and evaluate them thoroughly using both synthetic and real web-accessible data. citee id:526 citee title:incremental join queries on ranked inputs citee abstract:this paper investigates the problem of incremental joins of multiple ranked data sets when the join condition is a list of arbitrary user-defined predicates on the input tuples. this problem arises in many important applications dealing with ordered inputs and multiple ranked data sets, and requiring the top k solutions. we use multimedia applications as the motivating examples but the problem is equally applicable to traditional database applications involving optimal resource allocation, scheduling, decision making, ranking, etc. we propose an algorithmj that enables querying of ordered data sets by imposing arbitrary userdefined join predicates. the basic version of the algorithm does not use any random access but a j pa variation can exploit available indexes for efficient random access based on the join predicates. a special case includes the join scenario considered by fagin for joins based on identical keys, and in that case, our algorithms perform as efficiently as fagins. our main contribution, however, is the generalization to join scenarios that were previously unsupported, including cases where random access in the algorithm is not possible due to lack of unique keys. in addition, j can support multiple join levels, or nested join hierarchies, which are the norm for modeling multimedia data. we also give -approximation versions of both of the above algorithms. finally, we give strong optimality results for some of the proposed algorithms, and we study their performance empirically surrounding text:recently, natsev et al. proposed incremental algorithms [***]<2> to compute top-k queries with user-defined join predicates over sorted-access sources. finally, the wsq/dsq project [8]<2> presented an architecture for integrating webaccessible search engines with relational dbmss influence:3 type:2 pair index:604 citer id:518 citer title:evaluating top-k queries overweb-accessible databases with global page ordering citer abstract:a query to a web search engine usually consists of a list of keywords, to which the search engine responds with the best or top k pages for the query. this top-k query model is prevalent over multimedia collections in general, but also over plain relational data for certain applications. for example, consider a relation with information on available restaurants, including their location, price range for one diner, and overall food rating. a user who queries such a relation might simply specify the users location and target price range, and expect in return the best 10 restaurants in terms of some combination of proximity to the user, closeness of match to the target price range, and overall food rating. processing such top-k queries efficiently is challenging for a number of reasons. one critical such reason is that, in many web applications, the relation attributes might not be available other than through external web-accessible form interfaces, which we will have to query repeatedly for a potentially large set of candidate objects. in this paper, we study how to process topk queries efficiently in this setting, where the attributes for which users specify target values might be handled by external, autonomous sources with a variety of access interfaces. we present several algorithms for processing such queries, and evaluate them thoroughly using both synthetic and real web-accessible data. citee id:527 citee title:query processing issues in image (multimedia) databases citee abstract:multimedia databases have attracted academic and industrialinterest, and systems such as qbic (content basedimage retrieval system from ibm) have been releasedfi suchsystems are essential to effectively and efficiently use theexisting large collections of image data in the modern computingenvironmentfi the aim of such systems is to enableretrieval of images based on their contentsfi this problemhas brought together the (decades old) database and imageprocessing communitiesfias part of our surrounding text:s algorithms to our scenario and experimentally compared the resulting techniques with our new approach in section 5. nepal and ramakrishna [***]<2> and guntzer et al. [9]<2> presented variations of fagins original fa algorithm [6]<2> for processing queries over multimedia databases influence:2 type:2 pair index:605 citer id:518 citer title:evaluating top-k queries overweb-accessible databases with global page ordering citer abstract:a query to a web search engine usually consists of a list of keywords, to which the search engine responds with the best or top k pages for the query. this top-k query model is prevalent over multimedia collections in general, but also over plain relational data for certain applications. for example, consider a relation with information on available restaurants, including their location, price range for one diner, and overall food rating. a user who queries such a relation might simply specify the users location and target price range, and expect in return the best 10 restaurants in terms of some combination of proximity to the user, closeness of match to the target price range, and overall food rating. processing such top-k queries efficiently is challenging for a number of reasons. one critical such reason is that, in many web applications, the relation attributes might not be available other than through external web-accessible form interfaces, which we will have to query repeatedly for a potentially large set of candidate objects. in this paper, we study how to process topk queries efficiently in this setting, where the attributes for which users specify target values might be handled by external, autonomous sources with a variety of access interfaces. we present several algorithms for processing such queries, and evaluate them thoroughly using both synthetic and real web-accessible data. citee id:469 citee title:supporting ranked boolean similarity queries in mars citee abstract:to address the emerging needs of applications that require access to and retrieval of multimedia objects, weare developing the multimedia analysis and retrieval system (mars) fi in this paper, we concentrate onthe retrieval subsystem of mars and its support for content-based queries over image databasesfi content-basedretrieval techniques have been extensively studied for textual documents in the area of automatic informationretrieval fi this paper describes how these techniques surrounding text:[9]<2> reduce the number of random accesses through the introduction of more stop-condition tests and by exploiting the data distribution. the mars system [***]<2> also uses variations of the fa algorithm and views queries as binary trees where the leaves are single-attribute queries and the internal nodes correspond to fuzzy query operators. chaudhuri and gravano also built on fagins original fa algorithm and proposed a cost-based approach for optimizing the execution of top-k queries over multimedia repositories [3]<2> influence:3 type:2 pair index:606 citer id:518 citer title:evaluating top-k queries overweb-accessible databases with global page ordering citer abstract:a query to a web search engine usually consists of a list of keywords, to which the search engine responds with the best or top k pages for the query. this top-k query model is prevalent over multimedia collections in general, but also over plain relational data for certain applications. for example, consider a relation with information on available restaurants, including their location, price range for one diner, and overall food rating. a user who queries such a relation might simply specify the users location and target price range, and expect in return the best 10 restaurants in terms of some combination of proximity to the user, closeness of match to the target price range, and overall food rating. processing such top-k queries efficiently is challenging for a number of reasons. one critical such reason is that, in many web applications, the relation attributes might not be available other than through external web-accessible form interfaces, which we will have to query repeatedly for a potentially large set of candidate objects. in this paper, we study how to process topk queries efficiently in this setting, where the attributes for which users specify target values might be handled by external, autonomous sources with a variety of access interfaces. we present several algorithms for processing such queries, and evaluate them thoroughly using both synthetic and real web-accessible data. citee id:528 citee title:numerical recipes in c: the art of scientific computing citee abstract:this book, like its predecessor edition, is supposed to teach you methods of numerical computing that are practical, efficient, and (insofar as possible) elegant. we presume throughout this book that you, the reader, have particular tasks that you want to get done. we view our job as educating you on how to proceed. occasionally we may try to reroute you briefly onto a particularly beautiful side road; but by and large, we will guide you along main highways that lead to practical destinations. throughout this book, you will find us fearlessly editorializing, telling you what you should and shouldnt do. this prescriptive tone results from a conscious decision on our part, and we hope that you will not find it irritating. we do not claim that our advice is infallible! rather, we are reacting against a tendency, in the textbook literature of computation, to discuss every possible method that has ever been invented, without ever offering a practical judgment on relative merit. we do, therefore, offer you our practical judgments whenever we can. as you gain experience, you will form your own opinion of how reliable our advice is. we presume that you are able to read computer programs in c, that being the language of this version of numerical recipes (second edition). the book numerical recipes in fortran (second edition) is separately available, if you prefer to program in that language. earlier editions of numerical recipes in pascal and numerical recipes routines and examples in basic are also available; while not containing the additional material of the second edition versions in c and fortran, these versions are perfectly serviceable if pascal or basic is your language of choice. surrounding text:in contrast, when cf<0, the s-source scores are negatively correlated with the r-source scores.  gaussian data set: we generate the multiattribute score distribution by producing five overlapping multidimensional gaussian bells [***]<3>. the random-access cost for each r-source ri (i influence:3 type:3 pair index:607 citer id:538 citer title:exploiting hidden sentiment links in reviews for citer abstract:in past few years, much research e?ort has been devoted to opinion mining on product reviews. recently, some research e?ort has been devoted to the feature-level opinion mining instead of the whole review level by identifying the semantic relatedness between the product feature and the opinion words in its context. these methods usually utilize the explicit relatedness between product feature words and opinion words in reviews. however, the approach based on explicit adjacency may loss some hidden sentiment association between product features and opinion words. in this paper, we propose a novel unsupervised mutual reinforcement approach to deal with the feature-level product opinion mining problem. more specially, 1) the approach clusters product features and opinion words simultaneously and iteratively by fusing both their content information and sentiment link information. 2) under the same framework, based on the product feature categories and opinion word groups, we construct the sentiment association set between the two groups of data objects by identifying their strongest n sentiment links. moreover, knowledge from multi-source is incorporated to enhance clustering in the procedure. based on the pre-constructed association set, our approach can largely predict opinions relating to di?erent product features, even for the case without explicit appearance of product feature words in reviews. thus it provides a more accurate opinion evaluation. the experimental results demonstrate that our method outperforms the state-of-art algorithms citee id:529 citee title:evaluating wordnetbased measures of lexical semantic relatedness citee abstract:the quantification of lexical semantic relatedness has many applications in nlp, and many different measures have been proposed. we evaluate five of these measures, all of which use wordnet as their central resource, by comparing their performance in detecting and correcting real-word spelling errors. an information-content-based measure proposed by jiang and conrath is found superior to those proposed by hirst and st-onge, leacock and chodorow, lin, and resnik. in addition, we explain why distributional similarity is not an adequate proxy for lexical semantic relatedness. surrounding text:generally speaking, the noun network is richly developed in most of electronic lexicon like wordnet. comparing with nouns, researches on semantic relatedness using wordnet performed far worse for words with other part-ofspeeches [***]<3>. and in our research, the extracted product feature words are included in the set of nouns and noun phrases (see section 4 influence:3 type:3 pair index:608 citer id:538 citer title:exploiting hidden sentiment links in reviews for citer abstract:in past few years, much research e?ort has been devoted to opinion mining on product reviews. recently, some research e?ort has been devoted to the feature-level opinion mining instead of the whole review level by identifying the semantic relatedness between the product feature and the opinion words in its context. these methods usually utilize the explicit relatedness between product feature words and opinion words in reviews. however, the approach based on explicit adjacency may loss some hidden sentiment association between product features and opinion words. in this paper, we propose a novel unsupervised mutual reinforcement approach to deal with the feature-level product opinion mining problem. more specially, 1) the approach clusters product features and opinion words simultaneously and iteratively by fusing both their content information and sentiment link information. 2) under the same framework, based on the product feature categories and opinion word groups, we construct the sentiment association set between the two groups of data objects by identifying their strongest n sentiment links. moreover, knowledge from multi-source is incorporated to enhance clustering in the procedure. based on the pre-constructed association set, our approach can largely predict opinions relating to di?erent product features, even for the case without explicit appearance of product feature words in reviews. thus it provides a more accurate opinion evaluation. the experimental results demonstrate that our method outperforms the state-of-art algorithms citee id:539 citee title:noun phrase coreference as clustering citee abstract:this paper introduces a new, unsupervised algorithm for noun phrase coreference resolution. it differs from existing methods in that it views corererence resolution as a clustering task. in an evaluation on the muc-6 coreference resolution corpus, the algorithm achieves an f-measure of 53.6%~ placing it firmly between the worst (40%) and best (65%) systems in the muc-6 evaluation. more importantly, the clustering approach outperforms the only muc-6 system to treat coreference resolution as a learning problem. the clustering algorithm appears to provide a flexible mechanism for coordinating the application of context-independent and context-dependent constraints and preferences for accurate partitioning of noun phrases into coreference equivalence classes. surrounding text:we use the noun part to provide semantic class constraints. the performance of product feature categorization is evaluated using the measure of rand index [***, 18]<1>. in equation 7, p1 and p2 respectively represents the partition of an algorithm and manual labeling influence:3 type:3 pair index:609 citer id:538 citer title:exploiting hidden sentiment links in reviews for citer abstract:in past few years, much research e?ort has been devoted to opinion mining on product reviews. recently, some research e?ort has been devoted to the feature-level opinion mining instead of the whole review level by identifying the semantic relatedness between the product feature and the opinion words in its context. these methods usually utilize the explicit relatedness between product feature words and opinion words in reviews. however, the approach based on explicit adjacency may loss some hidden sentiment association between product features and opinion words. in this paper, we propose a novel unsupervised mutual reinforcement approach to deal with the feature-level product opinion mining problem. more specially, 1) the approach clusters product features and opinion words simultaneously and iteratively by fusing both their content information and sentiment link information. 2) under the same framework, based on the product feature categories and opinion word groups, we construct the sentiment association set between the two groups of data objects by identifying their strongest n sentiment links. moreover, knowledge from multi-source is incorporated to enhance clustering in the procedure. based on the pre-constructed association set, our approach can largely predict opinions relating to di?erent product features, even for the case without explicit appearance of product feature words in reviews. thus it provides a more accurate opinion evaluation. the experimental results demonstrate that our method outperforms the state-of-art algorithms citee id:401 citee title:mining the peanut gallery: opinion extraction and semantic classication of product reviews citee abstract:the web contains a wealth of product reviews, but sifting through them is a daunting task. ideally, an opinion mining tool would process a set of search results for a given item, generating a list of product attributes (quality, features, etc.) and aggregating opinions about each of them (poor, mixed, good). we begin by identifying the unique properties of this problem and develop a method for automatically distinguishing between positive and negative reviews. our classifier draws on information retrieval techniques for feature extraction and scoring, and the results for various metrics and heuristics vary depending on the testing situation. the best methods work as well as or better than traditional machine learning. when operating on individual sentences collected from web searches, performance is limited due to noise and ambiguity. but in the context of a complete web-based tool and aided by a simple method for grouping sentences into attributes, the results are qualitatively quite useful. surrounding text:mutual reinforcement, clustering 1. introduction opinion mining attracts extensive researches in recent years[***, 9, 12, 17]<2>. based on a collection of customer reviews, the task is to extract customers' opinions and identify their sentiment orientation influence:1 type:2 pair index:610 citer id:538 citer title:exploiting hidden sentiment links in reviews for citer abstract:in past few years, much research e?ort has been devoted to opinion mining on product reviews. recently, some research e?ort has been devoted to the feature-level opinion mining instead of the whole review level by identifying the semantic relatedness between the product feature and the opinion words in its context. these methods usually utilize the explicit relatedness between product feature words and opinion words in reviews. however, the approach based on explicit adjacency may loss some hidden sentiment association between product features and opinion words. in this paper, we propose a novel unsupervised mutual reinforcement approach to deal with the feature-level product opinion mining problem. more specially, 1) the approach clusters product features and opinion words simultaneously and iteratively by fusing both their content information and sentiment link information. 2) under the same framework, based on the product feature categories and opinion word groups, we construct the sentiment association set between the two groups of data objects by identifying their strongest n sentiment links. moreover, knowledge from multi-source is incorporated to enhance clustering in the procedure. based on the pre-constructed association set, our approach can largely predict opinions relating to di?erent product features, even for the case without explicit appearance of product feature words in reviews. thus it provides a more accurate opinion evaluation. the experimental results demonstrate that our method outperforms the state-of-art algorithms citee id:186 citee title:a system for summarizing and visualizing arguments in subjective documents: toward supporting decision making citee abstract:on the world wide web, the volume of subjective information, such as opinions and reviews, has been increasing rapidly. the trends and rules latent in a large set of subjective descriptions can potentially be useful for decision-making purposes. in this paper, we propose a method for summarizing subjective descriptions, specifically opinions in japanese. we visualize the pro and con arguments for a target topic, such as should japan introduce the summertime system? users can summarize the arguments about the topic in order to choose a more reasonable standpoint for decision making. we evaluate our system, called opinionreader, experimentally. surrounding text:in the research of[7, 8]<2>, they proposed that other components of a sentence are unlikely to be product features except for nouns and noun phrases. in the paper of [***]<2>, they targeted nouns, noun phrases and verb phrases. the adding of verb phrases caused the identication of more possible product features, while brought lots of noises influence:3 type:2 pair index:611 citer id:538 citer title:exploiting hidden sentiment links in reviews for citer abstract:in past few years, much research e?ort has been devoted to opinion mining on product reviews. recently, some research e?ort has been devoted to the feature-level opinion mining instead of the whole review level by identifying the semantic relatedness between the product feature and the opinion words in its context. these methods usually utilize the explicit relatedness between product feature words and opinion words in reviews. however, the approach based on explicit adjacency may loss some hidden sentiment association between product features and opinion words. in this paper, we propose a novel unsupervised mutual reinforcement approach to deal with the feature-level product opinion mining problem. more specially, 1) the approach clusters product features and opinion words simultaneously and iteratively by fusing both their content information and sentiment link information. 2) under the same framework, based on the product feature categories and opinion word groups, we construct the sentiment association set between the two groups of data objects by identifying their strongest n sentiment links. moreover, knowledge from multi-source is incorporated to enhance clustering in the procedure. based on the pre-constructed association set, our approach can largely predict opinions relating to di?erent product features, even for the case without explicit appearance of product feature words in reviews. thus it provides a more accurate opinion evaluation. the experimental results demonstrate that our method outperforms the state-of-art algorithms citee id:295 citee title:chinese named entity recognition based on multilevel linguistic features citee abstract:this paper presents a chinese named entity recognition system that employs the robust risk minimization (rrm) classification method and incorporates the advantages of character-based and word-based models. from experiments on a large-scale corpus, we show that significant performance enhancements can be obtained by integrating various linguistic information (such as chinese word segmentation, semantic types, part of speech, and named entity triggers) into a basic chinese character based model. a novel feature weighting mechanism is also employed to obtain more useful cues from most important linguistic features. moreover, to overcome the limitation of computational resources in building a high-quality named entity recognition system from a large-scale corpus, informative samples are selected by an active learning approach surrounding text:named entity recognition (ner) is utilized in the process. we use a ner system developed by ibm[***]<1>. the system can recognize four types of nes: person (per), location (loc), organization (org), and miscellaneous ne (misc) that does not belong to the previous three groups (e influence:3 type:3 pair index:612 citer id:538 citer title:exploiting hidden sentiment links in reviews for citer abstract:in past few years, much research e?ort has been devoted to opinion mining on product reviews. recently, some research e?ort has been devoted to the feature-level opinion mining instead of the whole review level by identifying the semantic relatedness between the product feature and the opinion words in its context. these methods usually utilize the explicit relatedness between product feature words and opinion words in reviews. however, the approach based on explicit adjacency may loss some hidden sentiment association between product features and opinion words. in this paper, we propose a novel unsupervised mutual reinforcement approach to deal with the feature-level product opinion mining problem. more specially, 1) the approach clusters product features and opinion words simultaneously and iteratively by fusing both their content information and sentiment link information. 2) under the same framework, based on the product feature categories and opinion word groups, we construct the sentiment association set between the two groups of data objects by identifying their strongest n sentiment links. moreover, knowledge from multi-source is incorporated to enhance clustering in the procedure. based on the pre-constructed association set, our approach can largely predict opinions relating to di?erent product features, even for the case without explicit appearance of product feature words in reviews. thus it provides a more accurate opinion evaluation. the experimental results demonstrate that our method outperforms the state-of-art algorithms citee id:403 citee title:mining and summarizing customer reviews citee abstract:merchants selling products on the web often ask their customers to review the products that they have purchased and the associated services. as e-commerce is becoming more and more popular, the number of customer reviews that a product receives grows rapidly. for a popular product, the number of reviews can be in hundreds or even thousands. this makes it difficult for a potential customer to read them to make an informed decision on whether to purchase the product. it also makes it difficult for the manufacturer of the product to keep track and to manage customer opinions. for the manufacturer, there are additional difficulties because many merchant sites may sell the same product and the manufacturer normally produces many kinds of products. in this research, we aim to mine and to summarize all the customer reviews of a product. this summarization task is different from traditional text summarization because we only mine the features of the product on which the customers have expressed their opinions and whether the opinions are positive or negative. we do not summarize the reviews by selecting a subset or rewrite some of the original sentences from the reviews to capture the main points as in the classic text summarization. our task is performed in three steps: (1) mining product features that have been commented on by customers; (2) identifying opinion sentences in each review and deciding whether each opinion sentence is positive or negative; (3) summarizing the results. this paper proposes several novel techniques to perform these tasks. our experimental results using reviews of a number of products sold online demonstrate the effectiveness of the techniques surrounding text:however, for many applications, simply judging the sentiment orientation of a review unit is not sucient. researchers[***, 8, 9, 14]<2> began to work on feature-oriented opinion mining which evaluates sentiment orientation relating to di. erent review features. while not so much work has been done on feature level opinion mining. liu[9]<2>and hu[***, 8]<2>'s works may be the most representative researches in this area. the appearance of implicit product feature was rst showed in their papers. so most of the existing researches take adjectives as opinion words. in the research of[***, 8]<2>, they proposed that other components of a sentence are unlikely to be product features except for nouns and noun phrases. in the paper of [4]<2>, they targeted nouns, noun phrases and verb phrases influence:1 type:2 pair index:613 citer id:538 citer title:exploiting hidden sentiment links in reviews for citer abstract:in past few years, much research e?ort has been devoted to opinion mining on product reviews. recently, some research e?ort has been devoted to the feature-level opinion mining instead of the whole review level by identifying the semantic relatedness between the product feature and the opinion words in its context. these methods usually utilize the explicit relatedness between product feature words and opinion words in reviews. however, the approach based on explicit adjacency may loss some hidden sentiment association between product features and opinion words. in this paper, we propose a novel unsupervised mutual reinforcement approach to deal with the feature-level product opinion mining problem. more specially, 1) the approach clusters product features and opinion words simultaneously and iteratively by fusing both their content information and sentiment link information. 2) under the same framework, based on the product feature categories and opinion word groups, we construct the sentiment association set between the two groups of data objects by identifying their strongest n sentiment links. moreover, knowledge from multi-source is incorporated to enhance clustering in the procedure. based on the pre-constructed association set, our approach can largely predict opinions relating to di?erent product features, even for the case without explicit appearance of product feature words in reviews. thus it provides a more accurate opinion evaluation. the experimental results demonstrate that our method outperforms the state-of-art algorithms citee id:540 citee title:mining opinion features in customer reviews citee abstract:it is a common practice that merchants selling products on the web ask their customers to review the products and associated services. as e-commerce is becoming more and more popular, the number of customer reviews that a product receives grows rapidly. for a popular product, the number of reviews can be in hundreds. this makes it difficult for a potential customer to read them in order to make a decision on whether to buy the product. in this project, we aim to summarize all the customer reviews of a product. this summarization task is different from traditional text summarization because we are only interested in the specific features of the product that customers have opinions on and also whether the opinions are positive or negative. we do not summarize the reviews by selecting or rewriting a subset of the original sentences from the reviews to capture their main points as in the classic text summarization. in this paper, we only focus on mining opinion/product features that the reviewers have commented on. a number of techniques are presented to mine such features. our experimental results show that these techniques are highly effective. surrounding text:however, for many applications, simply judging the sentiment orientation of a review unit is not sucient. researchers[7, ***, 9, 14]<2> began to work on feature-oriented opinion mining which evaluates sentiment orientation relating to di. erent review features. while not so much work has been done on feature level opinion mining. liu[9]<2>and hu[7, ***]<2>'s works may be the most representative researches in this area. the appearance of implicit product feature was rst showed in their papers. so most of the existing researches take adjectives as opinion words. in the research of[7, ***]<2>, they proposed that other components of a sentence are unlikely to be product features except for nouns and noun phrases. in the paper of [4]<2>, they targeted nouns, noun phrases and verb phrases influence:1 type:2 pair index:614 citer id:538 citer title:exploiting hidden sentiment links in reviews for citer abstract:in past few years, much research e?ort has been devoted to opinion mining on product reviews. recently, some research e?ort has been devoted to the feature-level opinion mining instead of the whole review level by identifying the semantic relatedness between the product feature and the opinion words in its context. these methods usually utilize the explicit relatedness between product feature words and opinion words in reviews. however, the approach based on explicit adjacency may loss some hidden sentiment association between product features and opinion words. in this paper, we propose a novel unsupervised mutual reinforcement approach to deal with the feature-level product opinion mining problem. more specially, 1) the approach clusters product features and opinion words simultaneously and iteratively by fusing both their content information and sentiment link information. 2) under the same framework, based on the product feature categories and opinion word groups, we construct the sentiment association set between the two groups of data objects by identifying their strongest n sentiment links. moreover, knowledge from multi-source is incorporated to enhance clustering in the procedure. based on the pre-constructed association set, our approach can largely predict opinions relating to di?erent product features, even for the case without explicit appearance of product feature words in reviews. thus it provides a more accurate opinion evaluation. the experimental results demonstrate that our method outperforms the state-of-art algorithms citee id:114 citee title:opinion observer: analyzing and comparing opinions on the web citee abstract:the web has become an excellent source for gathering consumer opinions. there are now numerous web sites containing such opinions, e.g., customer reviews of products, forums, discussion groups, and blogs. this paper focuses on online customer reviews of products. it makes two contributions. first, it proposes a novel framework for analyzing and comparing consumer opinions of competing products. a prototype system called opinion observer is also implemented. the system is such that with a single glance of its visualization, the user is able to clearly see the strengths and weaknesses of each product in the minds of consumers in terms of various product features. this comparison is useful to both potential customers and product manufacturers. for a potential customer, he/she can see a visual side-by-side and feature-by-feature comparison of consumer opinions on these products, which helps him/her to decide which product to buy. for a product manufacturer, the comparison enables it to easily gather marketing intelligence and product benchmarking information. second, a new technique based on language pattern mining is proposed to extract product features from pros and cons in a particular type of reviews. such features form the basis for the above comparison. experimental results show that the technique is highly effective and outperform existing methods significantly. surrounding text:mutual reinforcement, clustering 1. introduction opinion mining attracts extensive researches in recent years[3, ***, 12, 17]<2>. based on a collection of customer reviews, the task is to extract customers' opinions and identify their sentiment orientation. however, for many applications, simply judging the sentiment orientation of a review unit is not sucient. researchers[7, 8, ***, 14]<2> began to work on feature-oriented opinion mining which evaluates sentiment orientation relating to di. erent review features. while not so much work has been done on feature level opinion mining. liu[***]<2>and hu[7, 8]<2>'s works may be the most representative researches in this area. the appearance of implicit product feature was rst showed in their papers. our denition of implicit product feature is a little di. erent from the denition in [***]<2>. in the paper, they gave an example to show the implicit product feature in a digital camera review:\included 16mb is stingy". words like \16mb" are treated as clustering objects to build product feature categories. the association rule mining approach in [***]<2> did good job in identifying product features, but it can not deal with the implicit feature identication e. ectively. they deal with the problems by the synonym set in wordnet and the semi-automated tagging of reviews. our approach groups product feature words (including words which are considered as product feature values in [***]<2>) into categories. it's an unsupervised method and easy to be adapted to new domains influence:1 type:2 pair index:615 citer id:538 citer title:exploiting hidden sentiment links in reviews for citer abstract:in past few years, much research e?ort has been devoted to opinion mining on product reviews. recently, some research e?ort has been devoted to the feature-level opinion mining instead of the whole review level by identifying the semantic relatedness between the product feature and the opinion words in its context. these methods usually utilize the explicit relatedness between product feature words and opinion words in reviews. however, the approach based on explicit adjacency may loss some hidden sentiment association between product features and opinion words. in this paper, we propose a novel unsupervised mutual reinforcement approach to deal with the feature-level product opinion mining problem. more specially, 1) the approach clusters product features and opinion words simultaneously and iteratively by fusing both their content information and sentiment link information. 2) under the same framework, based on the product feature categories and opinion word groups, we construct the sentiment association set between the two groups of data objects by identifying their strongest n sentiment links. moreover, knowledge from multi-source is incorporated to enhance clustering in the procedure. based on the pre-constructed association set, our approach can largely predict opinions relating to di?erent product features, even for the case without explicit appearance of product feature words in reviews. thus it provides a more accurate opinion evaluation. the experimental results demonstrate that our method outperforms the state-of-art algorithms citee id:541 citee title:some methods for classication and analysis of multivariate observations citee abstract:this paper describes a number of applications of the 'k-means', a procedure for classifying a random sample of points in e sub n. the procedure consists of starting with k groups which each consist of a single random point, and thereafter adding the points one after another to the group whose mean each point is nearest. after a point is added to a group, the mean of that group is adjusted so as to take account of the new point. thus at each stage there are in fact k means, one for each group. after the sample is processed in this way, the points are classified on the basis of nearness to the final means. the portions which result tend to be fficient in the sense of having low within class variance. applications are suggested for the problems of non-linear prediction, efficient communication, non-parametric tests of independence, similarity grouping, and automatic file construction. the extension of the methods to general metric spaces is indicated surrounding text:3. 3 product feature category optimization based on semantic and textual structural knowledge in the process of mutual reinforcement, any traditional clustering algorithm can be easily embedded into the iterative process, such as the k-means algorithm[***]<1> and other state-of-art algorithms. take the plain k-means algorithm as example, it is an unsupervised learning based on iterative relocation to partition a dataset into k clusters of similar datapoints, typically by minimizing an objective function of average squared distance influence:3 type:3 pair index:616 citer id:538 citer title:exploiting hidden sentiment links in reviews for citer abstract:in past few years, much research e?ort has been devoted to opinion mining on product reviews. recently, some research e?ort has been devoted to the feature-level opinion mining instead of the whole review level by identifying the semantic relatedness between the product feature and the opinion words in its context. these methods usually utilize the explicit relatedness between product feature words and opinion words in reviews. however, the approach based on explicit adjacency may loss some hidden sentiment association between product features and opinion words. in this paper, we propose a novel unsupervised mutual reinforcement approach to deal with the feature-level product opinion mining problem. more specially, 1) the approach clusters product features and opinion words simultaneously and iteratively by fusing both their content information and sentiment link information. 2) under the same framework, based on the product feature categories and opinion word groups, we construct the sentiment association set between the two groups of data objects by identifying their strongest n sentiment links. moreover, knowledge from multi-source is incorporated to enhance clustering in the procedure. based on the pre-constructed association set, our approach can largely predict opinions relating to di?erent product features, even for the case without explicit appearance of product feature words in reviews. thus it provides a more accurate opinion evaluation. the experimental results demonstrate that our method outperforms the state-of-art algorithms citee id:117 citee title:seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales citee abstract:we address the rating-inference problem, wherein rather than simply decide whether a review is "thumbs up" or "thumbs down", as in previous sentiment analysis work, one must determine an author's evaluation with respect to a multi-point scale (e.g., one to five "stars"). this task represents an interesting twist on standard multi-class text categorization because there are several different degrees of similarity between class labels; for example, "three stars" is intuitively closer to "four stars" than to "one star". we first evaluate human performance at the task. then, we apply a meta-algorithm, based on a metric labeling formulation of the problem, that alters a given n-ary classifier's output in an explicit attempt to ensure that similar items receive similar labels. we show that the meta-algorithm can provide significant improvements over both multi-class and regression versions of svms when we employ a novel similarity measure appropriate to the problem. surrounding text:mutual reinforcement, clustering 1. introduction opinion mining attracts extensive researches in recent years[3, 9, ***, 17]<2>. based on a collection of customer reviews, the task is to extract customers' opinions and identify their sentiment orientation influence:2 type:2 pair index:617 citer id:538 citer title:exploiting hidden sentiment links in reviews for citer abstract:in past few years, much research e?ort has been devoted to opinion mining on product reviews. recently, some research e?ort has been devoted to the feature-level opinion mining instead of the whole review level by identifying the semantic relatedness between the product feature and the opinion words in its context. these methods usually utilize the explicit relatedness between product feature words and opinion words in reviews. however, the approach based on explicit adjacency may loss some hidden sentiment association between product features and opinion words. in this paper, we propose a novel unsupervised mutual reinforcement approach to deal with the feature-level product opinion mining problem. more specially, 1) the approach clusters product features and opinion words simultaneously and iteratively by fusing both their content information and sentiment link information. 2) under the same framework, based on the product feature categories and opinion word groups, we construct the sentiment association set between the two groups of data objects by identifying their strongest n sentiment links. moreover, knowledge from multi-source is incorporated to enhance clustering in the procedure. based on the pre-constructed association set, our approach can largely predict opinions relating to di?erent product features, even for the case without explicit appearance of product feature words in reviews. thus it provides a more accurate opinion evaluation. the experimental results demonstrate that our method outperforms the state-of-art algorithms citee id:439 citee title:discovering word senses from text citee abstract:inventories of manually compiled dictionaries usually serve as a source for word senses. however, they often include many rare senses while missing corpus/domain-specific senses. we present a clustering algorithm called cbc (clustering by committee) that automatically discovers word senses from text. it initially discovers a set of tight clusters called committees that are well scattered in the similarity space. the centroid of the members of a committee is used as the feature vector of the cluster. we proceed by assigning words to their most similar clusters. after assigning an element to a cluster, we remove their overlapping features from the element. this allows cbc to discover the less frequent senses of a word and to avoid discovering duplicate senses. each cluster that a word belongs to represents one of its senses. we also present an evaluation methodology for automatically measuring the precision and recall of discovered senses. surrounding text:w2) denotes the co-occurrence frequency of w1 and w2 within a phrase's range. although mutual information weight is biased towards infrequent words [***]<3>, it can utilize more relatedness and restriction than other weight settings such as instance's document frequency (df) and etc. so we represent the instances by the pmi weight in this research influence:3 type:3 pair index:618 citer id:538 citer title:exploiting hidden sentiment links in reviews for citer abstract:in past few years, much research e?ort has been devoted to opinion mining on product reviews. recently, some research e?ort has been devoted to the feature-level opinion mining instead of the whole review level by identifying the semantic relatedness between the product feature and the opinion words in its context. these methods usually utilize the explicit relatedness between product feature words and opinion words in reviews. however, the approach based on explicit adjacency may loss some hidden sentiment association between product features and opinion words. in this paper, we propose a novel unsupervised mutual reinforcement approach to deal with the feature-level product opinion mining problem. more specially, 1) the approach clusters product features and opinion words simultaneously and iteratively by fusing both their content information and sentiment link information. 2) under the same framework, based on the product feature categories and opinion word groups, we construct the sentiment association set between the two groups of data objects by identifying their strongest n sentiment links. moreover, knowledge from multi-source is incorporated to enhance clustering in the procedure. based on the pre-constructed association set, our approach can largely predict opinions relating to di?erent product features, even for the case without explicit appearance of product feature words in reviews. thus it provides a more accurate opinion evaluation. the experimental results demonstrate that our method outperforms the state-of-art algorithms citee id:119 citee title:extracting product features and opinions from reviews citee abstract:consumers are often forced to wade through many on-line reviews in order to make an informed product choice. this paper introduces opine, an unsupervised information-extraction system which mines reviews in order to build a model of important product features, their evaluation by reviewers, and their relative quality across products.compared to previous work, opine achieves 22% higher precision (with only 3% lower recall) on the feature extraction task. opine's novel use of relaxation labeling for finding the semantic orientation of words in context leads to strong performance on the tasks of finding opinion phrases and their polarity. surrounding text:however, for many applications, simply judging the sentiment orientation of a review unit is not sucient. researchers[7, 8, 9, ***]<2> began to work on feature-oriented opinion mining which evaluates sentiment orientation relating to di. erent review features influence:1 type:2 pair index:619 citer id:538 citer title:exploiting hidden sentiment links in reviews for citer abstract:in past few years, much research e?ort has been devoted to opinion mining on product reviews. recently, some research e?ort has been devoted to the feature-level opinion mining instead of the whole review level by identifying the semantic relatedness between the product feature and the opinion words in its context. these methods usually utilize the explicit relatedness between product feature words and opinion words in reviews. however, the approach based on explicit adjacency may loss some hidden sentiment association between product features and opinion words. in this paper, we propose a novel unsupervised mutual reinforcement approach to deal with the feature-level product opinion mining problem. more specially, 1) the approach clusters product features and opinion words simultaneously and iteratively by fusing both their content information and sentiment link information. 2) under the same framework, based on the product feature categories and opinion word groups, we construct the sentiment association set between the two groups of data objects by identifying their strongest n sentiment links. moreover, knowledge from multi-source is incorporated to enhance clustering in the procedure. based on the pre-constructed association set, our approach can largely predict opinions relating to di?erent product features, even for the case without explicit appearance of product feature words in reviews. thus it provides a more accurate opinion evaluation. the experimental results demonstrate that our method outperforms the state-of-art algorithms citee id:392 citee title:cws: a comparative web search system citee abstract:in this paper, we define and study a novel search problem: comparative web search (cws). the task of cws is to seek relevant and comparative information from the web to help users conduct comparisons among a set of topics. a system called cws is developed to effectively facilitate web users comparison needs. given a set of queries, which represent the topics that a user wants to compare, the system is characterized by: (1) automatic retrieval and ranking of web pages by incorporating both their relevance to the queries and the comparative contents they contain; (2) automatic clustering of the comparative contents into semantically meaningful themes; (3) extraction of representative keyphrases to summarize the commonness and differences of the comparative contents in each theme. we developed a novel interface which supports two types of view modes: a pair-view which displays the result in the page level, and a cluster-view which organizes the comparative pages into the themes and displays the extracted phrases to facilitate users comparison. experiment results show the cws system is effective and efficient. surrounding text:opinion mining is supposed to have many potential applications. for web search, it can be integrated into search engines to satisfy users' searching goals on opinion, such as comparative web search(cws)[***]<3> and opinion question answering[20]<3>. the task of opinion mining has been usually approached as a classication of either positive or negative on a review or its snippet influence:2 type:3 pair index:620 citer id:538 citer title:exploiting hidden sentiment links in reviews for citer abstract:in past few years, much research e?ort has been devoted to opinion mining on product reviews. recently, some research e?ort has been devoted to the feature-level opinion mining instead of the whole review level by identifying the semantic relatedness between the product feature and the opinion words in its context. these methods usually utilize the explicit relatedness between product feature words and opinion words in reviews. however, the approach based on explicit adjacency may loss some hidden sentiment association between product features and opinion words. in this paper, we propose a novel unsupervised mutual reinforcement approach to deal with the feature-level product opinion mining problem. more specially, 1) the approach clusters product features and opinion words simultaneously and iteratively by fusing both their content information and sentiment link information. 2) under the same framework, based on the product feature categories and opinion word groups, we construct the sentiment association set between the two groups of data objects by identifying their strongest n sentiment links. moreover, knowledge from multi-source is incorporated to enhance clustering in the procedure. based on the pre-constructed association set, our approach can largely predict opinions relating to di?erent product features, even for the case without explicit appearance of product feature words in reviews. thus it provides a more accurate opinion evaluation. the experimental results demonstrate that our method outperforms the state-of-art algorithms citee id:18 citee title:elements of information theory citee abstract:all the essential topics in information theory are covered in detail, including entropy, data compression, channel capacity, rate distortion, network information theory, and hypothesis testing. the authors provide readers with a solid understanding of the underlying theory and applications. problem sets and a telegraphic summary at the end of each chapter further assist readers. the historical notes that follow each chapter recap the main points. surrounding text:w2 be two words or phrases. the pointwise mutual information[***]<1> between w1 and w2 is dened as: pmi(w1. w2) = log p0(w1 influence:3 type:1 pair index:621 citer id:538 citer title:exploiting hidden sentiment links in reviews for citer abstract:in past few years, much research e?ort has been devoted to opinion mining on product reviews. recently, some research e?ort has been devoted to the feature-level opinion mining instead of the whole review level by identifying the semantic relatedness between the product feature and the opinion words in its context. these methods usually utilize the explicit relatedness between product feature words and opinion words in reviews. however, the approach based on explicit adjacency may loss some hidden sentiment association between product features and opinion words. in this paper, we propose a novel unsupervised mutual reinforcement approach to deal with the feature-level product opinion mining problem. more specially, 1) the approach clusters product features and opinion words simultaneously and iteratively by fusing both their content information and sentiment link information. 2) under the same framework, based on the product feature categories and opinion word groups, we construct the sentiment association set between the two groups of data objects by identifying their strongest n sentiment links. moreover, knowledge from multi-source is incorporated to enhance clustering in the procedure. based on the pre-constructed association set, our approach can largely predict opinions relating to di?erent product features, even for the case without explicit appearance of product feature words in reviews. thus it provides a more accurate opinion evaluation. the experimental results demonstrate that our method outperforms the state-of-art algorithms citee id:407 citee title:thumbs up or thumbs down? semantic orientation applied to unsupervised classication of reviews citee abstract:this paper presents a simple unsupervised learning algorithm for classifying reviews as recommended (thumbs up) or not recommended (thumbs down). the classification of a review is predicted by the average semantic orientation of the phrases in the review that contain adjectives or adverbs. a phrase has a positive semantic orientation when it has good associations (e.g., subtle nuances) and a negative semantic orientation when it has bad associations (e.g., very cavalier). in this paper, the semantic orientation of a phrase is calculated as the mutual information between the given phrase and the word excellent minus the mutual information between the given phrase and the word poor. a review is classified as recommended if the average semantic orientation of its phrases is positive. the algorithm achieves an average accuracy of 74% when evaluated on 410 reviews from epinions, sampled from four different domains (reviews of automobiles, banks, movies, and travel destinations). the accuracy ranges from 84% for automobile reviews to 66% for movie reviews surrounding text:mutual reinforcement, clustering 1. introduction opinion mining attracts extensive researches in recent years[3, 9, 12, ***]<2>. based on a collection of customer reviews, the task is to extract customers' opinions and identify their sentiment orientation influence:1 type:2 pair index:622 citer id:538 citer title:exploiting hidden sentiment links in reviews for citer abstract:in past few years, much research e?ort has been devoted to opinion mining on product reviews. recently, some research e?ort has been devoted to the feature-level opinion mining instead of the whole review level by identifying the semantic relatedness between the product feature and the opinion words in its context. these methods usually utilize the explicit relatedness between product feature words and opinion words in reviews. however, the approach based on explicit adjacency may loss some hidden sentiment association between product features and opinion words. in this paper, we propose a novel unsupervised mutual reinforcement approach to deal with the feature-level product opinion mining problem. more specially, 1) the approach clusters product features and opinion words simultaneously and iteratively by fusing both their content information and sentiment link information. 2) under the same framework, based on the product feature categories and opinion word groups, we construct the sentiment association set between the two groups of data objects by identifying their strongest n sentiment links. moreover, knowledge from multi-source is incorporated to enhance clustering in the procedure. based on the pre-constructed association set, our approach can largely predict opinions relating to di?erent product features, even for the case without explicit appearance of product feature words in reviews. thus it provides a more accurate opinion evaluation. the experimental results demonstrate that our method outperforms the state-of-art algorithms citee id:348 citee title:constrained k-means clustering with background knowledge citee abstract:clustering is traditionally viewed as an unsupervised method for data analysis. however, in some cases information about the problem domain is available in addition to the data instances themselves. in this paper, we demonstrate how the popular k-means clustering algorithm can be profitably modified to make use of this information. in experiments with artificial constraints on six data sets, we observe improvements in clustering accuracy. we also apply this method to the real-world problem of automatically detecting road lanes from gps data and observe dramatic increases in performance. surrounding text:our basic idea of clustering enhancement by background knowledge comes from cop-kmeans. cop-kmeans [***]<1> is a semi-supervised variant of k-means. background knowledge, provided in the form of constraints between data objects, is used to generate the partition in the clustering process. we use the noun part to provide semantic class constraints. the performance of product feature categorization is evaluated using the measure of rand index [2, ***]<1>. in equation 7, p1 and p2 respectively represents the partition of an algorithm and manual labeling influence:2 type:1 pair index:623 citer id:538 citer title:exploiting hidden sentiment links in reviews for citer abstract:in past few years, much research e?ort has been devoted to opinion mining on product reviews. recently, some research e?ort has been devoted to the feature-level opinion mining instead of the whole review level by identifying the semantic relatedness between the product feature and the opinion words in its context. these methods usually utilize the explicit relatedness between product feature words and opinion words in reviews. however, the approach based on explicit adjacency may loss some hidden sentiment association between product features and opinion words. in this paper, we propose a novel unsupervised mutual reinforcement approach to deal with the feature-level product opinion mining problem. more specially, 1) the approach clusters product features and opinion words simultaneously and iteratively by fusing both their content information and sentiment link information. 2) under the same framework, based on the product feature categories and opinion word groups, we construct the sentiment association set between the two groups of data objects by identifying their strongest n sentiment links. moreover, knowledge from multi-source is incorporated to enhance clustering in the procedure. based on the pre-constructed association set, our approach can largely predict opinions relating to di?erent product features, even for the case without explicit appearance of product feature words in reviews. thus it provides a more accurate opinion evaluation. the experimental results demonstrate that our method outperforms the state-of-art algorithms citee id:542 citee title:reinforcing web-object categorization through interrelationships citee abstract:existing categorization algorithms deal with homogeneous web objects, and consider interrelated objects as additional features when taking the interrelationships with other types of objects into account. however, focusing on any single aspect of the inter-object relationship is not sufficient to fully reveal the true categories of web objects. in this paper, we propose a novel categorization algorithm, called the iterative reinforcement categorization algorithm (irc), to exploit the full interrelationship between different types of web objects on the web, including web pages and queries. irc classifies the interrelated web objects by iteratively reinforcing the individual classification results of different types of objects via their interrelationship. experiments on a clickthrough-log dataset from the msn search engine show that, in terms of the f1 measure, irc achieves a 26.4% improvement over a pure content-based classification method. it also achieves a 21% improvement over a query-metadata-based method, as well as a 16.4% improvement on f1 measure over the well-known virtual document-based method. our experiments show that irc converges fast enough to be applicable to real world applications. surrounding text:our approach associates product feature categories and opinion word groups by their interrelationship. the idea of mutual reinforcement for multi-type interrelated data objects is utilized in some applications, such as web mining and collaborative ltering [***]<3>. we develop the idea to identify the association between product feature categories and opinion word groups, and simultaneously enhance clustering under the uniform framework influence:3 type:3 pair index:624 citer id:538 citer title:exploiting hidden sentiment links in reviews for citer abstract:in past few years, much research e?ort has been devoted to opinion mining on product reviews. recently, some research e?ort has been devoted to the feature-level opinion mining instead of the whole review level by identifying the semantic relatedness between the product feature and the opinion words in its context. these methods usually utilize the explicit relatedness between product feature words and opinion words in reviews. however, the approach based on explicit adjacency may loss some hidden sentiment association between product features and opinion words. in this paper, we propose a novel unsupervised mutual reinforcement approach to deal with the feature-level product opinion mining problem. more specially, 1) the approach clusters product features and opinion words simultaneously and iteratively by fusing both their content information and sentiment link information. 2) under the same framework, based on the product feature categories and opinion word groups, we construct the sentiment association set between the two groups of data objects by identifying their strongest n sentiment links. moreover, knowledge from multi-source is incorporated to enhance clustering in the procedure. based on the pre-constructed association set, our approach can largely predict opinions relating to di?erent product features, even for the case without explicit appearance of product feature words in reviews. thus it provides a more accurate opinion evaluation. the experimental results demonstrate that our method outperforms the state-of-art algorithms citee id:123 citee title:towards answering opinion questions: separating facts from opinions and identifying the polarity of opinion sentences citee abstract:opinion question answering is a challenging task for natural language processing. in this paper, we discuss a necessary component for an opinion question answering system: separating opinions from fact, at both the document and sentence level. we present a bayesian classifier for discriminating between documents with a preponderance of opinions such as editorials from regular news stories, and describe three unsupervised, statistical techniques for the significantly harder task of detecting opinions at the sentence level. we also present a first model for classifying opinion sentences as positive or negative in terms of the main perspective being expressed in the opinion. results from a large collection of news stories and a human evaluation of 400 sentences are reported, indicating that we achieve very high performance in document classification (upwards of 97% precision and recall), and respectable performance in detecting opinions and classifying them at the sentence level as positive, negative, or neutral (up to 91% accuracy). surrounding text:opinion mining is supposed to have many potential applications. for web search, it can be integrated into search engines to satisfy users' searching goals on opinion, such as comparative web search(cws)[15]<3> and opinion question answering[***]<3>. the task of opinion mining has been usually approached as a classication of either positive or negative on a review or its snippet influence:2 type:3 pair index:625 citer id:538 citer title:exploiting hidden sentiment links in reviews for citer abstract:in past few years, much research e?ort has been devoted to opinion mining on product reviews. recently, some research e?ort has been devoted to the feature-level opinion mining instead of the whole review level by identifying the semantic relatedness between the product feature and the opinion words in its context. these methods usually utilize the explicit relatedness between product feature words and opinion words in reviews. however, the approach based on explicit adjacency may loss some hidden sentiment association between product features and opinion words. in this paper, we propose a novel unsupervised mutual reinforcement approach to deal with the feature-level product opinion mining problem. more specially, 1) the approach clusters product features and opinion words simultaneously and iteratively by fusing both their content information and sentiment link information. 2) under the same framework, based on the product feature categories and opinion word groups, we construct the sentiment association set between the two groups of data objects by identifying their strongest n sentiment links. moreover, knowledge from multi-source is incorporated to enhance clustering in the procedure. based on the pre-constructed association set, our approach can largely predict opinions relating to di?erent product features, even for the case without explicit appearance of product feature words in reviews. thus it provides a more accurate opinion evaluation. the experimental results demonstrate that our method outperforms the state-of-art algorithms citee id:192 citee title:a unied framework for clustering heterogeneous web objects citee abstract:in this paper, we introduce a novel framework for clustering web data which is often heterogeneous in nature. as most existing methods often integrate heterogeneous data into a unified feature space, their flexibilities to explore and adjust contributing effect from different heterogeneous information are compromised. in contrast, our framework enables separate clustering of homogeneous data in the entire process based on their respective features, and a layered structure with link information surrounding text:in the procedure, a basic clustering algorithm is needed to cluster objects in each layer based on the dened similarity function (equation 1). in the rst step of iterative reinforcement, we cluster data objects only by their intra relationship without interrelated link information, since in most cases link information is too sparse in the beginning to help the clustering [***]<3>. then both intra- and inter- relationship are combined in the subsequent steps to iteratively enhance reinforcement influence:2 type:1 pair index:626 citer id:551 citer title:fast and robust fixed-point algorithms for independent component analysis citer abstract:independent component analysis (ica) is a statistical method for transforming an observed multidimen-sional random vector into components that are statistically as independent from each other as possible. in this paper, we use a combination of two different approaches for linear ica: comons information-theoretic approach and the projection pursuit approach. using maximum entropy approximations of differential en-tropy, we introduce a family of new contrast (objective) functions for ica. these contrast functions enable both the estimation of the whole decomposition by minimizing mutual information, and estimation of indi-vidual independent components as projection pursuit directions. the statistical properties of the estimators based on such contrast functions are analyzed under the assumption of the linear mixture model, and it is shown how to choose contrast functions that are robust and/or of minimum variance. finally, we intro-duce simple .xed-point algorithms for practical optimization of the contrast functions. these algorithms optimize the contrast functions very fast and reliably citee id:145 citee title:a new learning algorithm for blind source separation citee abstract:a new on-line learning algorithm which minimizes a statistical dependency among outputs is derived for blind separation of mixed signals. the dependency is measured by the average mutual information (mi) of the outputs. the source signals and the mixing matrix are unknown except for the number of the sources. the gram-charlier expansion instead of the edgeworth expansion is used in evaluating the mi. the natural gradient approach is used to minimize the mi. a novel activation function is surrounding text:we use here the new approximations developed in [19]<1>, based on the maximum entropy principle. in [19]<1> it was shown that these approximations are often considerably more accurate than the conventional, cumulant-based approximations in [7, ***, 26]<2>. in the simplest case, these new approximations are of the form: j(yi)c[e{g(yi)} influence:3 type:1 pair index:627 citer id:551 citer title:fast and robust fixed-point algorithms for independent component analysis citer abstract:independent component analysis (ica) is a statistical method for transforming an observed multidimen-sional random vector into components that are statistically as independent from each other as possible. in this paper, we use a combination of two different approaches for linear ica: comons information-theoretic approach and the projection pursuit approach. using maximum entropy approximations of differential en-tropy, we introduce a family of new contrast (objective) functions for ica. these contrast functions enable both the estimation of the whole decomposition by minimizing mutual information, and estimation of indi-vidual independent components as projection pursuit directions. the statistical properties of the estimators based on such contrast functions are analyzed under the assumption of the linear mixture model, and it is shown how to choose contrast functions that are robust and/or of minimum variance. finally, we intro-duce simple .xed-point algorithms for practical optimization of the contrast functions. these algorithms optimize the contrast functions very fast and reliably citee id:225 citee title:an information-maximization approach to blind separation and blind deconvolution citee abstract:we derive a new self-organizing learning algorithm that maximizes the information transferred in a network of nonlinear units. the algorithm does not assume any knowledge of the input distributions, and is defined here for the zero-noise limit. under these conditions, information maximization has extra properties not found in the linear case (linsker 1989). the nonlinearities in the transfer function are able to pick up higher-order moments of the input distributions and perform something akin to true redundancy reduction between units in the output representation. this enables the network to separate statistically independent components in the inputs: a higher-order generalization of principal components analysis. we apply the network to the source separation (or cocktail party) problem, successfully separating unknown mixtures of up to 10 speakers. we also show that a variant on the network architecture is able to perform blind deconvolution (cancellation of unknown echoes and reverberation in a speech signal). finally, we derive dependencies of information transfer on time delays. we suggest that information maximization provides a unifying framework for problems in "blind" signal processing surrounding text:for simplicity, one can choose gopt (u)= log fi(u). thus the optimal contrast function is the same as the one obtained by the maximum likelihood approach [34]<1>, or the infomax approach [***]<1>. almost identical results have also been obtained in [5]<2> for another algorithm. 3 that if g(u)grows fast with u, the estimator becomes highly non-robust against outliers. taking also into account the fact that most indepen-dent components encountered in practice are super-gaussian [***, 25]<1>, one reaches the conclusion that as a general-purpose contrast function, one should choose a function g that resembles rather gopt (u)= u ,where < 2. (13) the problem with such contrast functions is, however, that they are not differentiable at 0 for 1 influence:3 type:1 pair index:628 citer id:551 citer title:fast and robust fixed-point algorithms for independent component analysis citer abstract:independent component analysis (ica) is a statistical method for transforming an observed multidimen-sional random vector into components that are statistically as independent from each other as possible. in this paper, we use a combination of two different approaches for linear ica: comons information-theoretic approach and the projection pursuit approach. using maximum entropy approximations of differential en-tropy, we introduce a family of new contrast (objective) functions for ica. these contrast functions enable both the estimation of the whole decomposition by minimizing mutual information, and estimation of indi-vidual independent components as projection pursuit directions. the statistical properties of the estimators based on such contrast functions are analyzed under the assumption of the linear mixture model, and it is shown how to choose contrast functions that are robust and/or of minimum variance. finally, we intro-duce simple .xed-point algorithms for practical optimization of the contrast functions. these algorithms optimize the contrast functions very fast and reliably citee id:511 citee title:equivariant adaptive source separation citee abstract:source separation consists of recovering a set of independent signals when only mixtures with unknown coefficients are observed. this paper introduces a class of adaptive algorithms for source separation that implements an adaptive version of equivariant estimation and is henceforth called equivariant adaptive separation via independence (easi). the easi algorithms are based on the idea of serial updating. this specific form of matrix updates systematically yields algorithms with a simple structure for both real and complex mixtures. most importantly, the performance of an easi algorithm does not depend on the mixing matrix. in particular, convergence rates, stability conditions, and interference rejection levels depend only on the (normalized) distributions of the source signals. closed-form expressions of these quantities are given via an asymptotic performance analysis. the theme of equivariance is stressed throughout the paper. the source separation problem has an underlying multiplicative structure. the parameter space forms a (matrix) multiplicative group. we explore the (favorable) consequences of this fact on implementation, performance, and optimization of easi algorithms surrounding text:thus the optimal contrast function is the same as the one obtained by the maximum likelihood approach [34]<1>, or the infomax approach [3]<1>. almost identical results have also been obtained in [***]<2> for another algorithm. the theorem above treats, however, the one-unit case instead of the multi-unit case treated by the other authors influence:3 type:1 pair index:629 citer id:551 citer title:fast and robust fixed-point algorithms for independent component analysis citer abstract:independent component analysis (ica) is a statistical method for transforming an observed multidimen-sional random vector into components that are statistically as independent from each other as possible. in this paper, we use a combination of two different approaches for linear ica: comons information-theoretic approach and the projection pursuit approach. using maximum entropy approximations of differential en-tropy, we introduce a family of new contrast (objective) functions for ica. these contrast functions enable both the estimation of the whole decomposition by minimizing mutual information, and estimation of indi-vidual independent components as projection pursuit directions. the statistical properties of the estimators based on such contrast functions are analyzed under the assumption of the linear mixture model, and it is shown how to choose contrast functions that are robust and/or of minimum variance. finally, we intro-duce simple .xed-point algorithms for practical optimization of the contrast functions. these algorithms optimize the contrast functions very fast and reliably citee id:552 citee title:independent component analysisa new concept? signal processing citee abstract:when applying unsupervised learning techniques like ica or temporal decorrelation for bss, a key question is whether the discovered projections are reliable. in other words: can we give error bars or can we assess the quality of our separation ? we use resampling methods to tackle these questions and show experimentally that our proposed variance estimations are strongly correlated to the separation error. we demonstrate that this reliability estimation can be used to choose an appropriate surrounding text:ned using such criteria as optimal dimension reduction, statistical interestingness of the resulting components si, simplicity of the transforma-tion, or other criteria, including application-oriented ones. we treat in this paper the problem of estimating the transformation given by (linear) independent compo-nent analysis (ica) [***, 27]<1>. as the name implies, the basic goal in determining the transformation is to. non-gaussianity of the independent components is necessary for the identi. ability of the model (2), see [***]<1>. comon [***]<1> showed how to obtain a more general formulation for ica that does not need to assume an underlying data model. ability of the model (2), see [***]<1>. comon [***]<1> showed how to obtain a more general formulation for ica that does not need to assume an underlying data model. this de. h(y) (4) where ygauss is a gaussian random vector of the same covariance matrix as y. negentropy can also be interpreted as a measure of nongaussianity [***]<1>. using the concept of differential entropy, one can de. . n [8, ***]<1>. mutual information is a natural measure of the dependence between random variables. it is particularly interesting to express mutual information using negentropy, constraining the variables to be uncorrelated. in this case, we have [***]<1> i(y1,y2,.. thus we de. ne in this paper, following [***]<1>, the ica of a random vector x as an invertible transformation s = wx as in (1) where the matrix w is determined so that the mutual information of the transformed components si is minimized. note that mutual information (or the independence of the components) is not affected by multiplication of the components by scalar constants. es the computations considerably. because negentropy is invariant for invertible linear transformations [***]<1>, it is now obvious from (5) that . nding an invertible transformation w that minimizes the mutual information is roughly equivalent to. the random variable yi is assumed to be of zero mean and unit variance. for symmetric variables, this is a generalization of the cumulant-based approximation in [***]<1>, which is obtained by taking g(yi)= yi 4. the choice of the function g is deferred to section 3. indeed, it can be accomplished by classical pca. for details, see [***, 12]<1>. 4. cations presuppose, of course, that the covariance matrix is not singular. if it is singular or near-singular, the dimension of the data must be reduced, for example with pca [***, 28]<1>. in practice, the expectations in the. log10 of estimation error 6 conclusions the problem of linear independent component analysis (ica), which is a form of redundancy reduction, was addressed. following comon [***]<1>, the ica problem was formulated as the search for a linear transformation that minimizes the mutual information of the resulting components. this is roughly equivalent to. the novel approximations of negentropy introduced in [19]<2> were then used for constructing novel contrast (objective) functions for ica. this resulted in a generalization of the kurtosis-based approach in [***, 9]<1>, and also enabled estimation of the independent components one by one. the statistical properties of these contrast functions were analyzed in the framework of the linear mixture model, and it was shown that for suitable choices of the contrast functions, the statistical properties were superior to those of the kurtosis-based approach influence:1 type:1 pair index:630 citer id:551 citer title:fast and robust fixed-point algorithms for independent component analysis citer abstract:independent component analysis (ica) is a statistical method for transforming an observed multidimen-sional random vector into components that are statistically as independent from each other as possible. in this paper, we use a combination of two different approaches for linear ica: comons information-theoretic approach and the projection pursuit approach. using maximum entropy approximations of differential en-tropy, we introduce a family of new contrast (objective) functions for ica. these contrast functions enable both the estimation of the whole decomposition by minimizing mutual information, and estimation of indi-vidual independent components as projection pursuit directions. the statistical properties of the estimators based on such contrast functions are analyzed under the assumption of the linear mixture model, and it is shown how to choose contrast functions that are robust and/or of minimum variance. finally, we intro-duce simple .xed-point algorithms for practical optimization of the contrast functions. these algorithms optimize the contrast functions very fast and reliably citee id:18 citee title:elements of information theory citee abstract:all the essential topics in information theory are covered in detail, including entropy, data compression, channel capacity, rate distortion, network information theory, and hypothesis testing. the authors provide readers with a solid understanding of the underlying theory and applications. problem sets and a telegraphic summary at the end of each chapter further assist readers. the historical notes that follow each chapter recap the main points. surrounding text:. n [***, 7]<1>. mutual information is a natural measure of the dependence between random variables influence:3 type:1 pair index:631 citer id:551 citer title:fast and robust fixed-point algorithms for independent component analysis citer abstract:independent component analysis (ica) is a statistical method for transforming an observed multidimen-sional random vector into components that are statistically as independent from each other as possible. in this paper, we use a combination of two different approaches for linear ica: comons information-theoretic approach and the projection pursuit approach. using maximum entropy approximations of differential en-tropy, we introduce a family of new contrast (objective) functions for ica. these contrast functions enable both the estimation of the whole decomposition by minimizing mutual information, and estimation of indi-vidual independent components as projection pursuit directions. the statistical properties of the estimators based on such contrast functions are analyzed under the assumption of the linear mixture model, and it is shown how to choose contrast functions that are robust and/or of minimum variance. finally, we intro-duce simple .xed-point algorithms for practical optimization of the contrast functions. these algorithms optimize the contrast functions very fast and reliably citee id:156 citee title:a projection pursuit algorithm for exploratory data analysis citee abstract:an algorithm for the analysis of multivariate data is presented and is discussed in terms of specific examples. the algorithm seeks to find one-and two-dimensional linear projections of multivariate data that are relatively highly revealing. surrounding text:a matlabtm implementation of the . xed-algorithm is available on the world wide web free of charge [***]<3>. influence:3 type:3 pair index:632 citer id:551 citer title:fast and robust fixed-point algorithms for independent component analysis citer abstract:independent component analysis (ica) is a statistical method for transforming an observed multidimen-sional random vector into components that are statistically as independent from each other as possible. in this paper, we use a combination of two different approaches for linear ica: comons information-theoretic approach and the projection pursuit approach. using maximum entropy approximations of differential en-tropy, we introduce a family of new contrast (objective) functions for ica. these contrast functions enable both the estimation of the whole decomposition by minimizing mutual information, and estimation of indi-vidual independent components as projection pursuit directions. the statistical properties of the estimators based on such contrast functions are analyzed under the assumption of the linear mixture model, and it is shown how to choose contrast functions that are robust and/or of minimum variance. finally, we intro-duce simple .xed-point algorithms for practical optimization of the contrast functions. these algorithms optimize the contrast functions very fast and reliably citee id:544 citee title:exploratory projection pursuit citee abstract:exploratory projection pursuit (epp) is a tool for an exploratory data analysis. the theorem of cram'er-wold allows us to concentrate on (bivariate) projections of the data. the work of diaconis & freedman (1984) emphasizes the search for projections with a non-normal density. index functions are used to describe the amount of structure of a projection. a (local) maximum of the index function provides interesting 2-dimensional views of the data. a variety of index functions have been... surrounding text:several principles and methods have been developed to . nd such a linear rep-resentation, including principal component analysis [30]<1>, factor analysis [15]<1>, projection pursuit [***, 16]<1>, independent component analysis [27]<1>, etc. the transformation may be de. indeed, it can be accomplished by classical pca. for details, see [7, ***]<1>. 4 influence:3 type:1 pair index:633 citer id:551 citer title:fast and robust fixed-point algorithms for independent component analysis citer abstract:independent component analysis (ica) is a statistical method for transforming an observed multidimen-sional random vector into components that are statistically as independent from each other as possible. in this paper, we use a combination of two different approaches for linear ica: comons information-theoretic approach and the projection pursuit approach. using maximum entropy approximations of differential en-tropy, we introduce a family of new contrast (objective) functions for ica. these contrast functions enable both the estimation of the whole decomposition by minimizing mutual information, and estimation of indi-vidual independent components as projection pursuit directions. the statistical properties of the estimators based on such contrast functions are analyzed under the assumption of the linear mixture model, and it is shown how to choose contrast functions that are robust and/or of minimum variance. finally, we intro-duce simple .xed-point algorithms for practical optimization of the contrast functions. these algorithms optimize the contrast functions very fast and reliably citee id:535 citee title:experimental comparison of neural ica algorithms citee abstract:several neural algorithms for independent component analysis (ica) have been introduced lately, but their computational properties have not yet been systematically studied. in this paper, we compare the accuracy, convergence speed, computational load, and other properties of five prominent neural or semi-neural ica algorithms. the comparison reveals some interesting differences between the algorithms. 1 introduction independent component analysis (ica) is an unsupervised technique which surrounding text:xed-point algorithm. in fact, a comparison of our algorithm with other algorithms was performed in [***]<2>, showing that the . xed-point algorithm gives approximately the same statistical ef influence:1 type:2 pair index:634 citer id:551 citer title:fast and robust fixed-point algorithms for independent component analysis citer abstract:independent component analysis (ica) is a statistical method for transforming an observed multidimen-sional random vector into components that are statistically as independent from each other as possible. in this paper, we use a combination of two different approaches for linear ica: comons information-theoretic approach and the projection pursuit approach. using maximum entropy approximations of differential en-tropy, we introduce a family of new contrast (objective) functions for ica. these contrast functions enable both the estimation of the whole decomposition by minimizing mutual information, and estimation of indi-vidual independent components as projection pursuit directions. the statistical properties of the estimators based on such contrast functions are analyzed under the assumption of the linear mixture model, and it is shown how to choose contrast functions that are robust and/or of minimum variance. finally, we intro-duce simple .xed-point algorithms for practical optimization of the contrast functions. these algorithms optimize the contrast functions very fast and reliably citee id:553 citee title:robust statistics citee abstract:in statistics, classical methods rely heavily on assumptions which are often not met in practice. in particular, it is often assumed that the data are normally distributed, at least approximately, or that the central limit theorem can be relied on to produce normally distributed estimates. unfortunately, when there are outliers in the data, classical methods often have very poor performance. robust statistics seeks to provide methods that emulate classical methods, but which are not unduly affected by outliers or other small departures from model assumptions. surrounding text:1. 3 robustness another very attractive property of an estimator is robustness against outliers [***]<2>. this means that single, highly erroneous observations do not have much in. to obtain a simple form of robustness called b-robustness, we would like the estimator to have a bounded in. uence function [***]<2>. again, we can adapt the results in [18]<1> influence:3 type:3 pair index:635 citer id:551 citer title:fast and robust fixed-point algorithms for independent component analysis citer abstract:independent component analysis (ica) is a statistical method for transforming an observed multidimen-sional random vector into components that are statistically as independent from each other as possible. in this paper, we use a combination of two different approaches for linear ica: comons information-theoretic approach and the projection pursuit approach. using maximum entropy approximations of differential en-tropy, we introduce a family of new contrast (objective) functions for ica. these contrast functions enable both the estimation of the whole decomposition by minimizing mutual information, and estimation of indi-vidual independent components as projection pursuit directions. the statistical properties of the estimators based on such contrast functions are analyzed under the assumption of the linear mixture model, and it is shown how to choose contrast functions that are robust and/or of minimum variance. finally, we intro-duce simple .xed-point algorithms for practical optimization of the contrast functions. these algorithms optimize the contrast functions very fast and reliably citee id:554 citee title:modern factor analysis citee abstract:this thoroughly revised edition of harry h. harman's authoritative text incorporates the many new advances made in computer science and technology over the last ten years. the author gives full coverage to both theoretical and applied aspects of factor analysis from its foundations through the most advanced techniques. this highly readable text will be welcomed by researchers and students working in psychology, statistics, economics, and related disciplines. surrounding text:several principles and methods have been developed to . nd such a linear rep-resentation, including principal component analysis [30]<1>, factor analysis [***]<1>, projection pursuit [12, 16]<1>, independent component analysis [27]<1>, etc. the transformation may be de influence:3 type:1 pair index:636 citer id:551 citer title:fast and robust fixed-point algorithms for independent component analysis citer abstract:independent component analysis (ica) is a statistical method for transforming an observed multidimen-sional random vector into components that are statistically as independent from each other as possible. in this paper, we use a combination of two different approaches for linear ica: comons information-theoretic approach and the projection pursuit approach. using maximum entropy approximations of differential en-tropy, we introduce a family of new contrast (objective) functions for ica. these contrast functions enable both the estimation of the whole decomposition by minimizing mutual information, and estimation of indi-vidual independent components as projection pursuit directions. the statistical properties of the estimators based on such contrast functions are analyzed under the assumption of the linear mixture model, and it is shown how to choose contrast functions that are robust and/or of minimum variance. finally, we intro-duce simple .xed-point algorithms for practical optimization of the contrast functions. these algorithms optimize the contrast functions very fast and reliably citee id:555 citee title:projection pursuit citee abstract:this article only concerns the automatic pp, as it is the pp that is used todayficlearly, non-linear structures such as clusters, separations and unexpected shapes areinterestingfi in general, the phrase "non-linear structures" refers to those structures whichcannot be detected easily by using a sample mean and a sample covariance matrixfi thesummary statistics like the sample mean and covariance reveal linear structures such as thelocation, scale and correlation structures, of a data setfi surrounding text:several principles and methods have been developed to . nd such a linear rep-resentation, including principal component analysis [30]<1>, factor analysis [15]<1>, projection pursuit [12, ***]<1>, independent component analysis [27]<1>, etc. the transformation may be de influence:3 type:1 pair index:637 citer id:551 citer title:fast and robust fixed-point algorithms for independent component analysis citer abstract:independent component analysis (ica) is a statistical method for transforming an observed multidimen-sional random vector into components that are statistically as independent from each other as possible. in this paper, we use a combination of two different approaches for linear ica: comons information-theoretic approach and the projection pursuit approach. using maximum entropy approximations of differential en-tropy, we introduce a family of new contrast (objective) functions for ica. these contrast functions enable both the estimation of the whole decomposition by minimizing mutual information, and estimation of indi-vidual independent components as projection pursuit directions. the statistical properties of the estimators based on such contrast functions are analyzed under the assumption of the linear mixture model, and it is shown how to choose contrast functions that are robust and/or of minimum variance. finally, we intro-duce simple .xed-point algorithms for practical optimization of the contrast functions. these algorithms optimize the contrast functions very fast and reliably citee id:556 citee title:one-unit contrast functions for independent component analysis: a statistical analysis citee abstract:the author introduced previously a large family of one-unit contrast functions to be used in independent component analysis (ica). in this paper, the family is analyzed mathematically in the case of a ?nite sample. two aspects of the estimators obtained using such contrast functions are considered: asymptotic variance, and robustness against outliers. an expression for the contrast function that minimizes the asymptotic variance is obtained as a function of the probability densities of the... surrounding text:com-parison of, say, the traces of the asymptotic covariance matrices of two estimators enables direct comparison of the mean-square error of the estimators. in [***]<1>, evaluation of asymptotic variances was addressed using a related family of contrast functions. in fact, it can be seen that the results in [***]<1> are valid even in this case, and thus we have the following theorem: theorem 2 the trace of the asymptotic (co)variance of w. in [***]<1>, evaluation of asymptotic variances was addressed using a related family of contrast functions. in fact, it can be seen that the results in [***]<1> are valid even in this case, and thus we have the following theorem: theorem 2 the trace of the asymptotic (co)variance of w. is minimized when g is of the form gopt (u)= k1log fi(u)+k2u2 +k3 (10) where fi(. uence function [14]<2>. again, we can adapt the results in [***]<1>. it turns out to be impossible to have a completely bounded in influence:1 type:2 pair index:638 citer id:551 citer title:fast and robust fixed-point algorithms for independent component analysis citer abstract:independent component analysis (ica) is a statistical method for transforming an observed multidimen-sional random vector into components that are statistically as independent from each other as possible. in this paper, we use a combination of two different approaches for linear ica: comons information-theoretic approach and the projection pursuit approach. using maximum entropy approximations of differential en-tropy, we introduce a family of new contrast (objective) functions for ica. these contrast functions enable both the estimation of the whole decomposition by minimizing mutual information, and estimation of indi-vidual independent components as projection pursuit directions. the statistical properties of the estimators based on such contrast functions are analyzed under the assumption of the linear mixture model, and it is shown how to choose contrast functions that are robust and/or of minimum variance. finally, we intro-duce simple .xed-point algorithms for practical optimization of the contrast functions. these algorithms optimize the contrast functions very fast and reliably citee id:557 citee title:new approximations of differential entropy for independent component analysis and projection pursuit citee abstract:we derive a first-order approximation of the density of maximum entropy for a continuous 1-d random variable, given a number of simple constraints. this results in a density expansion which is somewhat similar to the classical polynomial density expansions by gram-charlier and edgeworth. using this approximation of density, an approximation of 1-d differential entropy is derived. the approximation of entropy is both more exact and more robust against outliers than the classical approximation... surrounding text:nition of ica given above, a simple estimate of the negentropy (or of differential entropy) is needed. we use here the new approximations developed in [***]<1>, based on the maximum entropy principle. in [***]<1> it was shown that these approximations are often considerably more accurate than the conventional, cumulant-based approximations in [7, 1, 26]<2>. we use here the new approximations developed in [***]<1>, based on the maximum entropy principle. in [***]<1> it was shown that these approximations are often considerably more accurate than the conventional, cumulant-based approximations in [7, 1, 26]<2>. in the simplest case, these new approximations are of the form: j(yi)c[e{g(yi)} influence:1 type:1 pair index:639 citer id:551 citer title:fast and robust fixed-point algorithms for independent component analysis citer abstract:independent component analysis (ica) is a statistical method for transforming an observed multidimen-sional random vector into components that are statistically as independent from each other as possible. in this paper, we use a combination of two different approaches for linear ica: comons information-theoretic approach and the projection pursuit approach. using maximum entropy approximations of differential en-tropy, we introduce a family of new contrast (objective) functions for ica. these contrast functions enable both the estimation of the whole decomposition by minimizing mutual information, and estimation of indi-vidual independent components as projection pursuit directions. the statistical properties of the estimators based on such contrast functions are analyzed under the assumption of the linear mixture model, and it is shown how to choose contrast functions that are robust and/or of minimum variance. finally, we intro-duce simple .xed-point algorithms for practical optimization of the contrast functions. these algorithms optimize the contrast functions very fast and reliably citee id:51 citee title:a fast algorithm for estimating overcomplete ica bases for image windows citee abstract:we introduce a very fast method for estimating overcomplete bases of independent components from image data. this is based on the concept of quasi-orthogonality, which means that in a very high-dimensional space, there can be a large, overcomplete set of vectors that are almost orthogonal to each other. thus we may estimate an overcomplete basis by using oneunit ica algorithms and forcing only partial decorrelation between the different independent components. the method can be implemented using a modification of the fastica algorithm, which leads to a computationally highly efficient method. surrounding text:simulations as well as applications on real-life data have validated the novel contrast functions and algorithms introduced. some extensions of the methods introduced in this paper are presented in [20]<1>, in which the problem of noisy data is addressed, and in [***]<1>, which deals with the situation where there are more independent components than observed variables. references [1] s influence:3 type:2 pair index:640 citer id:551 citer title:fast and robust fixed-point algorithms for independent component analysis citer abstract:independent component analysis (ica) is a statistical method for transforming an observed multidimen-sional random vector into components that are statistically as independent from each other as possible. in this paper, we use a combination of two different approaches for linear ica: comons information-theoretic approach and the projection pursuit approach. using maximum entropy approximations of differential en-tropy, we introduce a family of new contrast (objective) functions for ica. these contrast functions enable both the estimation of the whole decomposition by minimizing mutual information, and estimation of indi-vidual independent components as projection pursuit directions. the statistical properties of the estimators based on such contrast functions are analyzed under the assumption of the linear mixture model, and it is shown how to choose contrast functions that are robust and/or of minimum variance. finally, we intro-duce simple .xed-point algorithms for practical optimization of the contrast functions. these algorithms optimize the contrast functions very fast and reliably citee id:558 citee title:independent component analysis by general nonlinear hebbian-like learning rules citee abstract:a number of neural learning rules have been recently proposed... in this paper, we show that in fact, ica can be performed by very simple hebbian or anti-hebbian learning rules, which may have only weak relations to such information-theoretical quantities. rather suprisingly, practically any non-linear function can be used in the learning rule, provided only that the sign of the hebbian/anti-hebbian term is chosen correctly. in addition to the hebbian-like mechanism, the weight vector is here surrounding text:in fact, . nding a single direction that maximizes negentropy is a form of projection pursuit, and could also be interpreted as estimation of a single independent component [***]<1>. 2. ), and is a standardized gaussian variable. this theorem can be considered a corollary of the theorem in [***]<1>. the condition in theorem 1 seems to be true for most reasonable choices of g, and distributions of the si. the constraint could be taken into account by a bigradient feedback. this leads to neural (adaptive) algorithms that are closely related related to those introduced in [***]<1>. we show in the appendix b how to modify the algorithms in [***]<1> to minimize the contrast functions used in this paper. this leads to neural (adaptive) algorithms that are closely related related to those introduced in [***]<1>. we show in the appendix b how to modify the algorithms in [***]<1> to minimize the contrast functions used in this paper. the advantage of neural on-line learning rules is that the inputs x(t)can be used in the algorithm at once, thus enabling faster adaptation in a non-stationary environment. if the convergence is not satisfactory, one may then increase the sample size. a reduction of the step size in the stabilized version has a similar effect, as is well-known in stochastic approximation methods [***, 28]<1>. 4 influence:1 type:1 pair index:641 citer id:551 citer title:fast and robust fixed-point algorithms for independent component analysis citer abstract:independent component analysis (ica) is a statistical method for transforming an observed multidimen-sional random vector into components that are statistically as independent from each other as possible. in this paper, we use a combination of two different approaches for linear ica: comons information-theoretic approach and the projection pursuit approach. using maximum entropy approximations of differential en-tropy, we introduce a family of new contrast (objective) functions for ica. these contrast functions enable both the estimation of the whole decomposition by minimizing mutual information, and estimation of indi-vidual independent components as projection pursuit directions. the statistical properties of the estimators based on such contrast functions are analyzed under the assumption of the linear mixture model, and it is shown how to choose contrast functions that are robust and/or of minimum variance. finally, we intro-duce simple .xed-point algorithms for practical optimization of the contrast functions. these algorithms optimize the contrast functions very fast and reliably citee id:284 citee title:blind separation of sources citee abstract:the blind source separation problem is to extract the underlying source signals from a set of linear mixtures, where the mixing matrix is unknown. this situation is common in acoustics, radio, medical signal and image processing, hyperspectral imaging, and other areas. we suggest a two-stage separation process: a priori selection of a possibly overcomplete signal dictionary (for instance, a wavelet frame or a learned dictionary) in which the sources are assumed to be sparsely representable, followed by unmixing the sources by exploiting the their sparse representability. we consider the general case of more sources than mixtures, but also derive a more efficient algorithm in the case of a nonovercomplete dictionary and an equal numbers of sources and mixtures. experiments with artificial signals and musical sounds demonstrate significantly better separation than other known techniques. surrounding text:several principles and methods have been developed to . nd such a linear rep-resentation, including principal component analysis [30]<1>, factor analysis [15]<1>, projection pursuit [12, 16]<1>, independent component analysis [***]<1>, etc. the transformation may be de. ned using such criteria as optimal dimension reduction, statistical interestingness of the resulting components si, simplicity of the transforma-tion, or other criteria, including application-oriented ones. we treat in this paper the problem of estimating the transformation given by (linear) independent compo-nent analysis (ica) [7, ***]<1>. as the name implies, the basic goal in determining the transformation is to. two promising applications of ica are blind source separation and feature extraction. in blind source separation [***]<1>, the observed values of x correspond to a realization of an m-dimensional discrete-time signal x(t), t = 1,2,. influence:1 type:1 pair index:642 citer id:551 citer title:fast and robust fixed-point algorithms for independent component analysis citer abstract:independent component analysis (ica) is a statistical method for transforming an observed multidimen-sional random vector into components that are statistically as independent from each other as possible. in this paper, we use a combination of two different approaches for linear ica: comons information-theoretic approach and the projection pursuit approach. using maximum entropy approximations of differential en-tropy, we introduce a family of new contrast (objective) functions for ica. these contrast functions enable both the estimation of the whole decomposition by minimizing mutual information, and estimation of indi-vidual independent components as projection pursuit directions. the statistical properties of the estimators based on such contrast functions are analyzed under the assumption of the linear mixture model, and it is shown how to choose contrast functions that are robust and/or of minimum variance. finally, we intro-duce simple .xed-point algorithms for practical optimization of the contrast functions. these algorithms optimize the contrast functions very fast and reliably citee id:9 citee title:a class of neural networks for independent component analysis citee abstract:independent component analysis (ica) is a recently developed, useful extension of standard principal component analysis (pca). the ica model is utilized mainly in blind separation of unknown source signals from their linear mixtures. in this application only the source signals which correspond to the coefficients of the ica expansion are of interest. in this paper, we propose neural structures related to multilayer feedforward networks for performing complete ica. the basic ica network consists of whitening, separation, and basis vector estimation layers. it can be used for both blind source separation and estimation of the basis vectors of ica. we consider learning algorithms for each layer, and modify our previous nonlinear pca type algorithms so that their separation capabilities are greatly improved. the proposed class of networks yields good results in test examples with both artificial and real-world data surrounding text:cations presuppose, of course, that the covariance matrix is not singular. if it is singular or near-singular, the dimension of the data must be reduced, for example with pca [7, ***]<1>. in practice, the expectations in the. if the convergence is not satisfactory, one may then increase the sample size. a reduction of the step size in the stabilized version has a similar effect, as is well-known in stochastic approximation methods [24, ***]<1>. 4 influence:2 type:2 pair index:643 citer id:551 citer title:fast and robust fixed-point algorithms for independent component analysis citer abstract:independent component analysis (ica) is a statistical method for transforming an observed multidimen-sional random vector into components that are statistically as independent from each other as possible. in this paper, we use a combination of two different approaches for linear ica: comons information-theoretic approach and the projection pursuit approach. using maximum entropy approximations of differential en-tropy, we introduce a family of new contrast (objective) functions for ica. these contrast functions enable both the estimation of the whole decomposition by minimizing mutual information, and estimation of indi-vidual independent components as projection pursuit directions. the statistical properties of the estimators based on such contrast functions are analyzed under the assumption of the linear mixture model, and it is shown how to choose contrast functions that are robust and/or of minimum variance. finally, we intro-duce simple .xed-point algorithms for practical optimization of the contrast functions. these algorithms optimize the contrast functions very fast and reliably citee id:548 citee title:extraction of ocular artifacts from eeg using independent component analysis citee abstract:laboratory of computer and information science, helsinki university of technology, hut, finland. ricardo.vigario@hut.fi eye activity is one of the main sources of artefacts in eeg and meg recordings. a new approach to the correction of these disturbances is presented using the statistical technique of independent component analysis. this technique separates components by the kurtosis of their amplitude distribution over time, thereby distinguishing between strictly periodical signals, regularly occurring signals and irregularly occurring signals. the latter category is usually formed by artefacts. through this approach, it is possible to isolate pure eye activity in the eeg recordings (including eog channels), and so reduce the amount of brain activity that is subtracted from the measurements, when extracting portions of the eog signals. surrounding text:experiments on different kinds of real life data have also been performed using the contrast functions and algorithms introduced above. these applications include artifact cancellation in eeg and meg [***, 37]<3>, decomposition of evoked . elds in meg [38]<3>, and feature extraction of image data [35, 25]<3> influence:2 type:3 pair index:644 citer id:576 citer title:gap: a factor model for discrete data citer abstract:we present a probabilistic model for a document corpus that combines many of the desirable features of previous models. the model is called gap for gamma-poisson, the distributions of the first and last random variable. gap is a factor model, that is it gives an approximate factorization of the document-term matrix into a product of matrices and x. these factors have strictly non-negative terms. gap is a generative probabilistic model that assigns finite probabilities to documents in a corpus. it can be computed with an efficient and simple em recurrence. for a suitable choice of parameters, the gap factorization maximizes independence between the factors. so it can be used as an independent-component algorithm adapted to document data. the form of the gap model is empirically as well as analytically motivated. it gives very accurate results as a probabilistic model (measured via perplexity) and as a retrieval model. the gap model projects documents and terms into a low-dimensional space of themes, and models texts as passages of terms on the same theme. categories and subject descriptors citee id:164 citee title:latent dirichlet allocation citee abstract:we describe latent dirichlet allocation (lda), a generative probabilistic model for collections of discrete data such as text corpora. lda is a three-level hierarchical bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. in the context of text modeling, the topic probabilities provide an explicit representation of a document. we present efficient approximate inference techniques based on variational methods and an em algorithm for empirical bayes parameter estimation. we report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic lsi model surrounding text:that is, it assigns a finite probability to any specific document, and it can generate random documents from the corpus. generative probabilistic models have been applied in ir before [3, ***]<1>, and gap has particular similarities to lda (latent dirichlet allocation) [***, 2]<1>. we will say more about their relationship during the derivation of the gap model in the next section. that is, it assigns a finite probability to any specific document, and it can generate random documents from the corpus. generative probabilistic models have been applied in ir before [3, ***]<1>, and gap has particular similarities to lda (latent dirichlet allocation) [***, 2]<1>. we will say more about their relationship during the derivation of the gap model in the next section. cran, the cranfield dataset has 1400 documents and 3763 distinct terms. in [***]<1>, the corpus was divided randomly into training and test sets. we followed this approach, and our training set comprised 1300 documents while the test set comprised the remaining 100 influence:1 type:1 pair index:645 citer id:576 citer title:gap: a factor model for discrete data citer abstract:we present a probabilistic model for a document corpus that combines many of the desirable features of previous models. the model is called gap for gamma-poisson, the distributions of the first and last random variable. gap is a factor model, that is it gives an approximate factorization of the document-term matrix into a product of matrices and x. these factors have strictly non-negative terms. gap is a generative probabilistic model that assigns finite probabilities to documents in a corpus. it can be computed with an efficient and simple em recurrence. for a suitable choice of parameters, the gap factorization maximizes independence between the factors. so it can be used as an independent-component algorithm adapted to document data. the form of the gap model is empirically as well as analytically motivated. it gives very accurate results as a probabilistic model (measured via perplexity) and as a retrieval model. the gap model projects documents and terms into a low-dimensional space of themes, and models texts as passages of terms on the same theme. categories and subject descriptors citee id:164 citee title:latent dirichlet allocation citee abstract:we describe latent dirichlet allocation (lda), a generative probabilistic model for collections of discrete data such as text corpora. lda is a three-level hierarchical bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. in the context of text modeling, the topic probabilities provide an explicit representation of a document. we present efficient approximate inference techniques based on variational methods and an em algorithm for empirical bayes parameter estimation. we report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic lsi model surrounding text:that is, it assigns a finite probability to any specific document, and it can generate random documents from the corpus. generative probabilistic models have been applied in ir before [3, 1]<1>, and gap has particular similarities to lda (latent dirichlet allocation) [1, ***]<1>. we will say more about their relationship during the derivation of the gap model in the next section influence:1 type:1 pair index:646 citer id:576 citer title:gap: a factor model for discrete data citer abstract:we present a probabilistic model for a document corpus that combines many of the desirable features of previous models. the model is called gap for gamma-poisson, the distributions of the first and last random variable. gap is a factor model, that is it gives an approximate factorization of the document-term matrix into a product of matrices and x. these factors have strictly non-negative terms. gap is a generative probabilistic model that assigns finite probabilities to documents in a corpus. it can be computed with an efficient and simple em recurrence. for a suitable choice of parameters, the gap factorization maximizes independence between the factors. so it can be used as an independent-component algorithm adapted to document data. the form of the gap model is empirically as well as analytically motivated. it gives very accurate results as a probabilistic model (measured via perplexity) and as a retrieval model. the gap model projects documents and terms into a low-dimensional space of themes, and models texts as passages of terms on the same theme. categories and subject descriptors citee id:427 citee title:probabilistic latent semantic indexing citee abstract:probabilistic latent semantic indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data. fitted from a training corpus of text documents by a generalization of the expectation maximization algorithm, the utilized model is able to deal with domain{specific synonymy as well as with polysemous words. in contrast to standard latent semantic indexing (lsi) by singular value decomposition, the probabilistic variant has a solid statistical foundation and defines a proper generative data model. retrieval experiments on a number of test collections indicate substantial performance gains over direct term matching methods as well as over lsi. in particular, the combination of models with di erent dimensionalities has proven to be advantageous surrounding text:that is, it assigns a finite probability to any specific document, and it can generate random documents from the corpus. generative probabilistic models have been applied in ir before [***, 1]<1>, and gap has particular similarities to lda (latent dirichlet allocation) [1, 2]<1>. we will say more about their relationship during the derivation of the gap model in the next section influence:2 type:1 pair index:647 citer id:576 citer title:gap: a factor model for discrete data citer abstract:we present a probabilistic model for a document corpus that combines many of the desirable features of previous models. the model is called gap for gamma-poisson, the distributions of the first and last random variable. gap is a factor model, that is it gives an approximate factorization of the document-term matrix into a product of matrices and x. these factors have strictly non-negative terms. gap is a generative probabilistic model that assigns finite probabilities to documents in a corpus. it can be computed with an efficient and simple em recurrence. for a suitable choice of parameters, the gap factorization maximizes independence between the factors. so it can be used as an independent-component algorithm adapted to document data. the form of the gap model is empirically as well as analytically motivated. it gives very accurate results as a probabilistic model (measured via perplexity) and as a retrieval model. the gap model projects documents and terms into a low-dimensional space of themes, and models texts as passages of terms on the same theme. categories and subject descriptors citee id:551 citee title:fast and robust fixed-point algorithms for independent component analysis citee abstract:independent component analysis (ica) is a statistical method for transforming an observed multidimen-sional random vector into components that are statistically as independent from each other as possible. in this paper, we use a combination of two different approaches for linear ica: comons information-theoretic approach and the projection pursuit approach. using maximum entropy approximations of differential en-tropy, we introduce a family of new contrast (objective) functions for ica. these contrast functions enable both the estimation of the whole decomposition by minimizing mutual information, and estimation of indi-vidual independent components as projection pursuit directions. the statistical properties of the estimators based on such contrast functions are analyzed under the assumption of the linear mixture model, and it is shown how to choose contrast functions that are robust and/or of minimum variance. finally, we intro-duce simple .xed-point algorithms for practical optimization of the contrast functions. these algorithms optimize the contrast functions very fast and reliably surrounding text:we will say more about their relationship during the derivation of the gap model in the next section. gap is also related to ica (independent component analysis) algorithms [***]<2> which also attempt to maximize the statistical independence of factors in a mixture distribution. there are two differences with ica and gap influence:2 type:2 pair index:648 citer id:576 citer title:gap: a factor model for discrete data citer abstract:we present a probabilistic model for a document corpus that combines many of the desirable features of previous models. the model is called gap for gamma-poisson, the distributions of the first and last random variable. gap is a factor model, that is it gives an approximate factorization of the document-term matrix into a product of matrices and x. these factors have strictly non-negative terms. gap is a generative probabilistic model that assigns finite probabilities to documents in a corpus. it can be computed with an efficient and simple em recurrence. for a suitable choice of parameters, the gap factorization maximizes independence between the factors. so it can be used as an independent-component algorithm adapted to document data. the form of the gap model is empirically as well as analytically motivated. it gives very accurate results as a probabilistic model (measured via perplexity) and as a retrieval model. the gap model projects documents and terms into a low-dimensional space of themes, and models texts as passages of terms on the same theme. categories and subject descriptors citee id:577 citee title:independent component analysis: algorithms and applications citee abstract:a fundamental problem in neural network research, as well as in many other disciplines, is finding a suitable representation of multivariate data, i.e. random vectors. for reasons of computational and conceptual simplicity, the representation is often sought as a linear transformation of the original data. in other words, each component of the representation is a linear combination of the original variables. well-known linear transformation methods include principal component analysis, factor analysis, and projection pursuit. independent component analysis (ica) is a recently developed method in which the goal is to find a linear representation of non-gaussian data so that the components are statistically independent, or as independent as possible. such a representation seems to capture the essential structure of the data in many applications, including feature extraction and signal separation. in this paper, we present the basic theory and applications of ica, and our recent work on the subject surrounding text:so the likelihood maximization will favor the most independent components. this is very similar to the popular technique of component analysis by kurtosis minimization from speech analysis [***]<2>, which has also been applied to ica. kurtosis is a general measure of skewedness of a distribution, and increases when signals are mixed influence:2 type:2 pair index:649 citer id:576 citer title:gap: a factor model for discrete data citer abstract:we present a probabilistic model for a document corpus that combines many of the desirable features of previous models. the model is called gap for gamma-poisson, the distributions of the first and last random variable. gap is a factor model, that is it gives an approximate factorization of the document-term matrix into a product of matrices and x. these factors have strictly non-negative terms. gap is a generative probabilistic model that assigns finite probabilities to documents in a corpus. it can be computed with an efficient and simple em recurrence. for a suitable choice of parameters, the gap factorization maximizes independence between the factors. so it can be used as an independent-component algorithm adapted to document data. the form of the gap model is empirically as well as analytically motivated. it gives very accurate results as a probabilistic model (measured via perplexity) and as a retrieval model. the gap model projects documents and terms into a low-dimensional space of themes, and models texts as passages of terms on the same theme. categories and subject descriptors citee id:179 citee title:a solution to platos problem: the latent semantic analysis theory of the acquisition, induction, and representation of knowledge citee abstract:how do people know as much as they do with as little information as they get? the problem takes many forms; learning vocabulary from text is an especially dramatic and convenient case for research. a new general theory of acquired similarity and knowledge representation, latent semantic analysis (lsa), is presented and used to successfully simulate such learning and several other psycholinguistic phenomena. by inducing global knowledge indirectly from local co-occurrence data in a large body of representative text, lsa acquired knowledge about the full vocabulary of english at a comparable rate to school-children. lsa uses no prior linguistic or perceptual similarity knowledge; it is based solely on a general mathematical learning method that achieves powerful inductive effects by extracting the right number of dimensions (e.g., 300) to represent objects and contexts. relations to other theories, phenomena, and problems are sketched surrounding text:2. related work because it computes an approximate factorization x of the document-term matrix, gap is related to latent semantic analysis [***, 7, 8]<1>. it can be used for document retrieval (in either the low-dimensional theme space, or as a smoothing algorithm in term space) influence:2 type:1 pair index:650 citer id:576 citer title:gap: a factor model for discrete data citer abstract:we present a probabilistic model for a document corpus that combines many of the desirable features of previous models. the model is called gap for gamma-poisson, the distributions of the first and last random variable. gap is a factor model, that is it gives an approximate factorization of the document-term matrix into a product of matrices and x. these factors have strictly non-negative terms. gap is a generative probabilistic model that assigns finite probabilities to documents in a corpus. it can be computed with an efficient and simple em recurrence. for a suitable choice of parameters, the gap factorization maximizes independence between the factors. so it can be used as an independent-component algorithm adapted to document data. the form of the gap model is empirically as well as analytically motivated. it gives very accurate results as a probabilistic model (measured via perplexity) and as a retrieval model. the gap model projects documents and terms into a low-dimensional space of themes, and models texts as passages of terms on the same theme. categories and subject descriptors citee id:578 citee title:introduction to latent semantic analysis citee abstract:latent semantic analysis (lsa) is a theory and method for extracting and representing the contextual-usage meaning of words by statistical computations applied to a large corpus of text (landauer and dumais, 1997). the underlying idea is that the aggregate of all the word contexts in which a given word does and does not appear provides a set of mutual constraints that largely determines the similarity of meaning of words and sets of words to each other. the adequacy of lsas reflection of human knowledge has been established in a variety of ways. for example, its scores overlap those of humans on standard vocabulary and subject matter tests; it mimics human word sorting and category judgments; it simulates wordcword and passagecword lexical priming data; and, as reported in 3 following articles in this issue, it accurately estimates passage coherence, learnability of passages by individual students, and the quality and quantity of knowledge contained in an essay surrounding text:2. related work because it computes an approximate factorization x of the document-term matrix, gap is related to latent semantic analysis [6, ***, 8]<1>. it can be used for document retrieval (in either the low-dimensional theme space, or as a smoothing algorithm in term space) influence:2 type:1 pair index:651 citer id:576 citer title:gap: a factor model for discrete data citer abstract:we present a probabilistic model for a document corpus that combines many of the desirable features of previous models. the model is called gap for gamma-poisson, the distributions of the first and last random variable. gap is a factor model, that is it gives an approximate factorization of the document-term matrix into a product of matrices and x. these factors have strictly non-negative terms. gap is a generative probabilistic model that assigns finite probabilities to documents in a corpus. it can be computed with an efficient and simple em recurrence. for a suitable choice of parameters, the gap factorization maximizes independence between the factors. so it can be used as an independent-component algorithm adapted to document data. the form of the gap model is empirically as well as analytically motivated. it gives very accurate results as a probabilistic model (measured via perplexity) and as a retrieval model. the gap model projects documents and terms into a low-dimensional space of themes, and models texts as passages of terms on the same theme. categories and subject descriptors citee id:579 citee title:learning human-like knowledge by singular value decomposition: a progress report citee abstract:singular value decomposition (svd) can be viewed as a method for unsupervised training of a network that associates two classes of events reciprocally by linear connections through a single hidden layer. svd was used to learn and represent relations among very large numbers of words (20k-60k) and very large numbers of natural text passages (1k70k) in which they occurred. the result was 100-350 dimensional "semantic spaces" in which any trained or newly added word or passage could be represented as a vector, and similarities were measured by the cosine of the contained angle between vectors. good accuracy in simulating human judgments and behaviors has been demonstrated by performance on multiple-choice vocabulary and domain knowledge tests, emulation of expert essay evaluations, and in several other ways. examples are also given of how the kind of knowledge extracted by this method can be applied. surrounding text:2. related work because it computes an approximate factorization x of the document-term matrix, gap is related to latent semantic analysis [6, 7, ***]<1>. it can be used for document retrieval (in either the low-dimensional theme space, or as a smoothing algorithm in term space) influence:2 type:1 pair index:652 citer id:576 citer title:gap: a factor model for discrete data citer abstract:we present a probabilistic model for a document corpus that combines many of the desirable features of previous models. the model is called gap for gamma-poisson, the distributions of the first and last random variable. gap is a factor model, that is it gives an approximate factorization of the document-term matrix into a product of matrices and x. these factors have strictly non-negative terms. gap is a generative probabilistic model that assigns finite probabilities to documents in a corpus. it can be computed with an efficient and simple em recurrence. for a suitable choice of parameters, the gap factorization maximizes independence between the factors. so it can be used as an independent-component algorithm adapted to document data. the form of the gap model is empirically as well as analytically motivated. it gives very accurate results as a probabilistic model (measured via perplexity) and as a retrieval model. the gap model projects documents and terms into a low-dimensional space of themes, and models texts as passages of terms on the same theme. categories and subject descriptors citee id:204 citee title:algorithms for non-negative matrix factorization citee abstract:non-negative matrix factorization (nmf) has previously been shown to be a useful decomposition for multivariate data. two different multi-plicative algorithms for nmf are analyzed. they differ only slightly in the multiplicative factor used in the update rules. one algorithm can be shown to minimize the conventional least squares error while the other minimizes the generalized kullback-leibler divergence. the monotonic convergence of both algorithms can be proven using an auxiliary func-tion analogous to that used for proving convergence of the expectation-maximization algorithm. the algorithms can also be interpreted as diag-onally rescaled gradient descent, where the rescaling factor is optimally chosen to ensure convergence surrounding text:5 recurrence formulae at this point, we observe that the equation for the loglikelihood (3) is similar to an entropy or kl-divergence measure. in [***]<2>, lee and seung considered a similar measure called divergence for non-negative matrix factorization. the measure they actually considered was n j=1 influence:3 type:2 pair index:653 citer id:576 citer title:gap: a factor model for discrete data citer abstract:we present a probabilistic model for a document corpus that combines many of the desirable features of previous models. the model is called gap for gamma-poisson, the distributions of the first and last random variable. gap is a factor model, that is it gives an approximate factorization of the document-term matrix into a product of matrices and x. these factors have strictly non-negative terms. gap is a generative probabilistic model that assigns finite probabilities to documents in a corpus. it can be computed with an efficient and simple em recurrence. for a suitable choice of parameters, the gap factorization maximizes independence between the factors. so it can be used as an independent-component algorithm adapted to document data. the form of the gap model is empirically as well as analytically motivated. it gives very accurate results as a probabilistic model (measured via perplexity) and as a retrieval model. the gap model projects documents and terms into a low-dimensional space of themes, and models texts as passages of terms on the same theme. categories and subject descriptors citee id:446 citee title:document clustering based on non-negative matrix factorization citee abstract:in this paper, we propose a novel document clustering method based on the non-negative factorization of the term-document matrix of the given document corpus. in the latent semantic space derived by the non-negative matrix factorization (nmf), each axis captures the base topic of a particular document cluster, and each document is represented as an additive combination of the base topics. the cluster membership of each document can be easily determined by finding the base topic (the axis) with which the document has the largest projection value. our experimental evaluations show that the proposed document clustering method surpasses the latent semantic indexing and the spectral clustering methods not only in the easy and reliable derivation of document clustering results, but also in document clustering accuracies. surrounding text:in fact we show that gap can be used to maximize the independence of components, intuitively a more useful condition than orthogonality. the factorization that gap computes is a nmf (nonnegative matrix factorization [***]<2>) of the document corpus. nmf was shown in [***]<2> to have extremely good performance in trec clustering benchmarks. the factorization that gap computes is a nmf (nonnegative matrix factorization [***]<2>) of the document corpus. nmf was shown in [***]<2> to have extremely good performance in trec clustering benchmarks. note though, that the factorization derived from gap will be different from [***]<2>, which is based on least-squares (implicit spherical gaussian distribution on f). nmf was shown in [***]<2> to have extremely good performance in trec clustering benchmarks. note though, that the factorization derived from gap will be different from [***]<2>, which is based on least-squares (implicit spherical gaussian distribution on f). because gap is based on discrete output variables, it is a generative probabilistic model influence:2 type:2 pair index:654 citer id:576 citer title:gap: a factor model for discrete data citer abstract:we present a probabilistic model for a document corpus that combines many of the desirable features of previous models. the model is called gap for gamma-poisson, the distributions of the first and last random variable. gap is a factor model, that is it gives an approximate factorization of the document-term matrix into a product of matrices and x. these factors have strictly non-negative terms. gap is a generative probabilistic model that assigns finite probabilities to documents in a corpus. it can be computed with an efficient and simple em recurrence. for a suitable choice of parameters, the gap factorization maximizes independence between the factors. so it can be used as an independent-component algorithm adapted to document data. the form of the gap model is empirically as well as analytically motivated. it gives very accurate results as a probabilistic model (measured via perplexity) and as a retrieval model. the gap model projects documents and terms into a low-dimensional space of themes, and models texts as passages of terms on the same theme. categories and subject descriptors citee id:184 citee title:a study of smoothing methods for language models applied to ad hoc information retrieval citee abstract:language modeling approaches to information retrieval are attractive and promising because they connect the problem of retrieval with that of language model estimation, which has been studied extensively in other application areas such as speech recognition. the basic idea of these approaches is to estimate a language model for each document, and then rank documents by the likelihood of the query according to the estimated language model. a core problem in language model estimation is smoothing, which adjusts the maximum likelihood estimator so as to correct the inaccuracy due to data sparseness. in this paper, we study the problem of language model smoothing and its influence on retrieval performance. we examine the sensitivity of retrieval performance to the smoothing parameters and compare several popular smoothing methods on different test collections. surrounding text:gap exhibits lower perplexity (better fit) across the dimension scale, especially at low dimensions. in a second experiment, we compared retrieval performance of gap with two standard retrieval methods, tfidf and kl-divergence with dirichlet smoothing on a unigram model [***]<2>. the implementation of both methods came from the lemur toolkit from cmu http://www influence:3 type:3 pair index:655 citer id:586 citer title:group and topic discovery from relations and text citer abstract:we present a probabilistic generative model of entity relationships and textual attributes that simultaneously discovers groups among the entities and topics among the corresponding text. block-models of relationship data have been studied in social network analysis for some time. here we simultaneously cluster in several modalities at once, incorporating the words associated with certain relationships. significantly, joint inference allows the discovery of groups to be guided by the emerging topics, and vice-versa. we present experimental results on two large data sets: sixteen years of bills put before the u.s. senate, comprising their corresponding text and voting records, and 43 years of similar data from the united nations. we show that in comparison with traditional, separate latent-variable models for words or blockstructures for votes, the group-topic models joint inference improves both the groups and topics discovered. categories and subject descriptors citee id:202 citee title:agglomerative clustering of a search engine query log citee abstract:this paper introduces a technique for mining a collection ofuser transactions with an internet search engine to discoverclusters of similar queries and similar urlsfi the informationwe exploit is "clickthrough data": each record consistsof a user"s query to a search engine along with the urlwhich the user selected from among the candidates offeredby the search enginefi by viewing this dataset as a bipartitegraph, with the vertices on one side corresponding to queriesand on the other side surrounding text:to capture this, we model relations as agreement between entities, not the yes/no vote itself. this kind of contentignorant feature is similarly found in some work on web log clustering [***]<2>. there has been a considerable amount of previous work in understanding voting patterns [10, 11, 7]<2>, including research on voting cohesion of countries in the eu parliament [11]<2> and partisanship in roll call voting [7]<2> influence:3 type:3 pair index:656 citer id:586 citer title:group and topic discovery from relations and text citer abstract:we present a probabilistic generative model of entity relationships and textual attributes that simultaneously discovers groups among the entities and topics among the corresponding text. block-models of relationship data have been studied in social network analysis for some time. here we simultaneously cluster in several modalities at once, incorporating the words associated with certain relationships. significantly, joint inference allows the discovery of groups to be guided by the emerging topics, and vice-versa. we present experimental results on two large data sets: sixteen years of bills put before the u.s. senate, comprising their corresponding text and voting records, and 43 years of similar data from the united nations. we show that in comparison with traditional, separate latent-variable models for words or blockstructures for votes, the group-topic models joint inference improves both the groups and topics discovered. categories and subject descriptors citee id:587 citee title:multi-way distributional clustering via pairwise interactions citee abstract: we present a novel unsupervised learning scheme that simultaneously clusters variables of several types (e g , documents, words and authors) based on pairwise interactions be - tween the types, as observed in co - occurrence data in this scheme, multiple clustering systems are generated aiming at maximizing an objective function that measures multiple pairwise mutual information between cluster variables to implement this idea, we pro - pose an algorithm that interleaves top - down clustering of some variables and bottom - up clustering of the other variables, with a local optimization correction routine focusing on document clustering we present an extensive empirical study of two - way, three - way and four - way applications of our scheme using six real - world datasets including the 20 news - surrounding text:the central theme of gt is that it simultaneously clusters entities and attributes on relations (words). there has been prior work in clustering different entities simultaneously, such as information theoretic co-clustering [9]<2>, and multi-way distributional clustering using pair-wise interactions [***]<2>. however, these models do not also cluster attributes based on interactions between entities in a network influence:2 type:2 pair index:657 citer id:586 citer title:group and topic discovery from relations and text citer abstract:we present a probabilistic generative model of entity relationships and textual attributes that simultaneously discovers groups among the entities and topics among the corresponding text. block-models of relationship data have been studied in social network analysis for some time. here we simultaneously cluster in several modalities at once, incorporating the words associated with certain relationships. significantly, joint inference allows the discovery of groups to be guided by the emerging topics, and vice-versa. we present experimental results on two large data sets: sixteen years of bills put before the u.s. senate, comprising their corresponding text and voting records, and 43 years of similar data from the united nations. we show that in comparison with traditional, separate latent-variable models for words or blockstructures for votes, the group-topic models joint inference improves both the groups and topics discovered. categories and subject descriptors citee id:395 citee title:deduplication and group detection using links citee abstract:categories and subject descriptors surrounding text:social scientists have conducted extensive research on group detection, especially in fields such as anthropology [8]<3> and political science [11, 7]<3>. recently, statisticians and computer scientists have begun to develop models that specifically discover group memberships [15, ***, 17, 13]<2>. one such model is the stochastic blockstructures model [17]<2>, which discovers the latent structure, groups or classes based on pair-wise relation data. relatedwork there has been a surge of interest in models that describe relational data, or relations between entities viewed as links in a network, including recent work in group discovery. one such algorithm, presented by bhattacharya and getoor [***]<2>, is a bottom-up agglomerative clustering algorithm that partitions links in a network into clusters by considering the change in likelihood that would occur if two clusters were merged. once the links have been grouped, the entities connected by the links are assigned to groups. the group detection algorithm (gda) uses a bayesian network to group entities from two datasets, demographic data describing the entities and link data. unlike our model, neither of these models [***, 15]<2> consider attributes associated with the links between the entities. the model presented in [15]<2> considers attributes of an entity rather than attributes of relations between entities influence:1 type:2 pair index:658 citer id:586 citer title:group and topic discovery from relations and text citer abstract:we present a probabilistic generative model of entity relationships and textual attributes that simultaneously discovers groups among the entities and topics among the corresponding text. block-models of relationship data have been studied in social network analysis for some time. here we simultaneously cluster in several modalities at once, incorporating the words associated with certain relationships. significantly, joint inference allows the discovery of groups to be guided by the emerging topics, and vice-versa. we present experimental results on two large data sets: sixteen years of bills put before the u.s. senate, comprising their corresponding text and voting records, and 43 years of similar data from the united nations. we show that in comparison with traditional, separate latent-variable models for words or blockstructures for votes, the group-topic models joint inference improves both the groups and topics discovered. categories and subject descriptors citee id:164 citee title:latent dirichlet allocation citee abstract:we describe latent dirichlet allocation (lda), a generative probabilistic model for collections of discrete data such as text corpora. lda is a three-level hierarchical bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. in the context of text modeling, the topic probabilities provide an explicit representation of a document. we present efficient approximate inference techniques based on variational methods and an em algorithm for empirical bayes parameter estimation. we report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic lsi model surrounding text:however, the art model does not explicitly capture groups formed by entities in the network. the gt model simultaneously clusters entities to groups and clusters words into topics, unlike models that generate topics solely based on word distributions such as latent dirichlet allocation [***]<1>. in this way the gt model discovers salient topics relevant to relationships between entities in the social networktopics which the models that only examine words are unable to detect influence:1 type:1 pair index:659 citer id:586 citer title:group and topic discovery from relations and text citer abstract:we present a probabilistic generative model of entity relationships and textual attributes that simultaneously discovers groups among the entities and topics among the corresponding text. block-models of relationship data have been studied in social network analysis for some time. here we simultaneously cluster in several modalities at once, incorporating the words associated with certain relationships. significantly, joint inference allows the discovery of groups to be guided by the emerging topics, and vice-versa. we present experimental results on two large data sets: sixteen years of bills put before the u.s. senate, comprising their corresponding text and voting records, and 43 years of similar data from the united nations. we show that in comparison with traditional, separate latent-variable models for words or blockstructures for votes, the group-topic models joint inference improves both the groups and topics discovered. categories and subject descriptors citee id:588 citee title:information-theoretic co-clustering citee abstract:two-dimensional contingency or co-occurrence tables arise frequently in important applications such as text, web-log and market-basket data analysisfi a basic problem in contingency table analysis is co-clustering: simultaneous clustering of the rows and columnsfi a novel theoretical formulation views the contingency table as an empirical joint probability distribution of two discrete random variables and poses the co-clustering problem as an optimization problem in information theory -- the surrounding text:the central theme of gt is that it simultaneously clusters entities and attributes on relations (words). there has been prior work in clustering different entities simultaneously, such as information theoretic co-clustering [***]<2>, and multi-way distributional clustering using pair-wise interactions [2]<2>. however, these models do not also cluster attributes based on interactions between entities in a network influence:2 type:2 pair index:660 citer id:586 citer title:group and topic discovery from relations and text citer abstract:we present a probabilistic generative model of entity relationships and textual attributes that simultaneously discovers groups among the entities and topics among the corresponding text. block-models of relationship data have been studied in social network analysis for some time. here we simultaneously cluster in several modalities at once, incorporating the words associated with certain relationships. significantly, joint inference allows the discovery of groups to be guided by the emerging topics, and vice-versa. we present experimental results on two large data sets: sixteen years of bills put before the u.s. senate, comprising their corresponding text and voting records, and 43 years of similar data from the united nations. we show that in comparison with traditional, separate latent-variable models for words or blockstructures for votes, the group-topic models joint inference improves both the groups and topics discovered. categories and subject descriptors citee id:589 citee title:how does europe make its mind up? connections, cliques, and compatibility between countries in the eurovision song contest citee abstract:we investigate the complex relationships between countries in the eurovision song contest, by recasting past voting data in terms of a dynamical network. despite the british tendency to feel distant from europe, our analysis shows that the u.k. is remarkably compatible, or 'in tune', with other european countries. equally surprising is our finding that some other core countries, most notably france, are significantly 'out of tune' with the rest of europe. in addition, our analysis enables us to confirm a widely-held belief that there are unofficial cliques of countries -- however these cliques are not always the expected ones, nor can their existence be explained solely on the grounds of geographical proximity. the complexity in this system emerges via the group 'self-assessment' process, and in the absence of any central controller. one might therefore speculate that such complexity is representative of many real-world situations in which groups of 'agents' establish their own inter-relationships and hence ultimately decide their own fate. possible examples include groups of individuals, societies, political groups or even governments. surrounding text:this kind of contentignorant feature is similarly found in some work on web log clustering [1]<2>. there has been a considerable amount of previous work in understanding voting patterns [***, 11, 7]<2>, including research on voting cohesion of countries in the eu parliament [11]<2> and partisanship in roll call voting [7]<2>. in these models roll call data are used to estimate ideal points of a legislator (which refers to a legislators preferred policy in the euclidean space of possible policies) influence:3 type:3 pair index:661 citer id:586 citer title:group and topic discovery from relations and text citer abstract:we present a probabilistic generative model of entity relationships and textual attributes that simultaneously discovers groups among the entities and topics among the corresponding text. block-models of relationship data have been studied in social network analysis for some time. here we simultaneously cluster in several modalities at once, incorporating the words associated with certain relationships. significantly, joint inference allows the discovery of groups to be guided by the emerging topics, and vice-versa. we present experimental results on two large data sets: sixteen years of bills put before the u.s. senate, comprising their corresponding text and voting records, and 43 years of similar data from the united nations. we show that in comparison with traditional, separate latent-variable models for words or blockstructures for votes, the group-topic models joint inference improves both the groups and topics discovered. categories and subject descriptors citee id:590 citee title:power to the parties: cohesion and competition in the european parliament citee abstract:how cohesive are political parties in the european parliament? what coalitions form and why? the answers to these questions are central for understanding the impact of the european parliament on european union policies. these questions are also central in the study of legislative behaviour in general. we collected the total population of roll-call votes in the european parliament, from the first elections, in 1979, to the end of 2001 (over 11,500 votes). the data show growing party cohesion despite growing internal national and ideological diversity within the european party groups. we also find that the distance between parties on the left-right dimension is the strongest predictor of coalition patterns. weconclude that increased power of the european parliament has meant increased power for the transnational parties, via increased internal party cohesion and inter-party competition. surrounding text:com. social scientists have conducted extensive research on group detection, especially in fields such as anthropology [8]<3> and political science [***, 7]<3>. recently, statisticians and computer scientists have begun to develop models that specifically discover group memberships [15, 3, 17, 13]<2> influence:3 type:3 pair index:662 citer id:586 citer title:group and topic discovery from relations and text citer abstract:we present a probabilistic generative model of entity relationships and textual attributes that simultaneously discovers groups among the entities and topics among the corresponding text. block-models of relationship data have been studied in social network analysis for some time. here we simultaneously cluster in several modalities at once, incorporating the words associated with certain relationships. significantly, joint inference allows the discovery of groups to be guided by the emerging topics, and vice-versa. we present experimental results on two large data sets: sixteen years of bills put before the u.s. senate, comprising their corresponding text and voting records, and 43 years of similar data from the united nations. we show that in comparison with traditional, separate latent-variable models for words or blockstructures for votes, the group-topic models joint inference improves both the groups and topics discovered. categories and subject descriptors citee id:235 citee title:analyzing the us senate in 2003: similarities, networks, clusters and blocs citee abstract:to analyze the roll calls in the us senate in year 2003, we have employed the methods already used throughout the science community for analysis of genes, surveys and text. with information-theoretic measures we assess the association between pairs of senators based on the votes they cast. furthermore, we can evaluate the influence of a voter by postulating a shannon information channel between the outcome and a voter. the matrix of associations can be summarized using hierarchical clustering, multi-dimensional scaling and link analysis. with a discrete latent variable model we identify blocs of cohesive voters within the senate, and contrast it with continuous ideal point methods. under the bloc-voting model, the senate can be interpreted as a weighted vote system, and we were able to estimate the empirical voting power of individual blocs through what-if analysis surrounding text:in contrast, the gt model presented here clusters entities into groups based on their relations to other entities. exploring the notion that the behavior of an entity can be explained by its (hidden) group membership, jakulin and buntine [***]<2> develop a discrete pca model for discovering groups. in the model each entity can belong to each of the k groups with a certain probability, and each group has its own specific pattern of behaviors. we apply our gt model also to voting data. however, unlike [***, 18]<2>, since our goal is to cluster entities based on the similarity of their voting patterns, we are only interested in whether a pair of entities voted the same or differently, not their actual yes/no votes. two resolutions on the same topic may differ only in their goal (e influence:2 type:2 pair index:663 citer id:586 citer title:group and topic discovery from relations and text citer abstract:we present a probabilistic generative model of entity relationships and textual attributes that simultaneously discovers groups among the entities and topics among the corresponding text. block-models of relationship data have been studied in social network analysis for some time. here we simultaneously cluster in several modalities at once, incorporating the words associated with certain relationships. significantly, joint inference allows the discovery of groups to be guided by the emerging topics, and vice-versa. we present experimental results on two large data sets: sixteen years of bills put before the u.s. senate, comprising their corresponding text and voting records, and 43 years of similar data from the united nations. we show that in comparison with traditional, separate latent-variable models for words or blockstructures for votes, the group-topic models joint inference improves both the groups and topics discovered. categories and subject descriptors citee id:420 citee title:discovering latent classes in relational data citee abstract:we present a framework for learning relational knowledge with the aimof explaining how people acquire intuitive theories of physical, biological, or socialsystems. our approach is based on a generative relational model with latent classes,and simultaneously determines the kinds of entities that exist in a domain, the num-ber of these latent classes, and the relations between classes that are possible or likely.this model goes beyond previous psychological models of category learning, whichconsider attributes associated with individual categories but not relationships betweencategories. we apply this domain-general framework to two specific problems: learn-ing the structure of kinship systems and learning causal theories. surrounding text:social scientists have conducted extensive research on group detection, especially in fields such as anthropology [8]<3> and political science [11, 7]<3>. recently, statisticians and computer scientists have begun to develop models that specifically discover group memberships [15, 3, 17, ***]<2>. one such model is the stochastic blockstructures model [17]<2>, which discovers the latent structure, groups or classes based on pair-wise relation data. the class assignments can be inferred from a graph of observed relations or link data using gibbs sampling [17]<2>. this model is extended in [***]<2> to automatically select an arbitrary number of groups by using a chinese restaurant process prior. the aforementioned models discover latent groups only by examining whether one or more relations exist between a pair of entities influence:1 type:2 pair index:664 citer id:586 citer title:group and topic discovery from relations and text citer abstract:we present a probabilistic generative model of entity relationships and textual attributes that simultaneously discovers groups among the entities and topics among the corresponding text. block-models of relationship data have been studied in social network analysis for some time. here we simultaneously cluster in several modalities at once, incorporating the words associated with certain relationships. significantly, joint inference allows the discovery of groups to be guided by the emerging topics, and vice-versa. we present experimental results on two large data sets: sixteen years of bills put before the u.s. senate, comprising their corresponding text and voting records, and 43 years of similar data from the united nations. we show that in comparison with traditional, separate latent-variable models for words or blockstructures for votes, the group-topic models joint inference improves both the groups and topics discovered. categories and subject descriptors citee id:148 citee title:a pcans model of structure in organization citee abstract:we present a network based approach to characterize c2 architectures in terms of three domain elements - individuals, tasks, and resources. characterizing the possible relations among these elements results in five relational primitives - precedence, commitment of resources, assignment of individuals to tasks, networks (of relations among personnel) and skills linkingindividuals to resources. we demonstrate the utility of this model for recharacterizingclassical organizational theory and for generating a series of testable hypotheses about c2performance. surrounding text:8 [database management]: database applicationsdata mining general terms algorithms, experimentation keywords graphical models, text modeling, relational learning 1. introduction research in the field of social network analysis (sna) has led to the development of mathematical models that discover patterns in interaction between entities [21, 5, ***]<1>. one of the objectives of sna is to detect salient groups of entities influence:2 type:3 pair index:665 citer id:586 citer title:group and topic discovery from relations and text citer abstract:we present a probabilistic generative model of entity relationships and textual attributes that simultaneously discovers groups among the entities and topics among the corresponding text. block-models of relationship data have been studied in social network analysis for some time. here we simultaneously cluster in several modalities at once, incorporating the words associated with certain relationships. significantly, joint inference allows the discovery of groups to be guided by the emerging topics, and vice-versa. we present experimental results on two large data sets: sixteen years of bills put before the u.s. senate, comprising their corresponding text and voting records, and 43 years of similar data from the united nations. we show that in comparison with traditional, separate latent-variable models for words or blockstructures for votes, the group-topic models joint inference improves both the groups and topics discovered. categories and subject descriptors citee id:591 citee title:stochastic link and group detection citee abstract:link detection and analysis has long been important in the social sciences and in the government intelligence communityfi a significant effort is focused on the structural and functional analysis of "known" networksfi similarly, the detection of individual links is important but is usually done with techniques that result in "known" linksfi more recently the internet and other sources have led to a flood of circumstantial data that provide probabilistic evidence of linksfi co-occurrence in news surrounding text:social scientists have conducted extensive research on group detection, especially in fields such as anthropology [8]<3> and political science [11, 7]<3>. recently, statisticians and computer scientists have begun to develop models that specifically discover group memberships [***, 3, 17, 13]<2>. one such model is the stochastic blockstructures model [17]<2>, which discovers the latent structure, groups or classes based on pair-wise relation data. once the links have been grouped, the entities connected by the links are assigned to groups. another model due to kubica et al$ [***]<2> considers both link evidence and attributes on entities to discover groups. the group detection algorithm (gda) uses a bayesian network to group entities from two datasets, demographic data describing the entities and link data. the group detection algorithm (gda) uses a bayesian network to group entities from two datasets, demographic data describing the entities and link data. unlike our model, neither of these models [3, ***]<2> consider attributes associated with the links between the entities. the model presented in [***]<2> considers attributes of an entity rather than attributes of relations between entities. unlike our model, neither of these models [3, ***]<2> consider attributes associated with the links between the entities. the model presented in [***]<2> considers attributes of an entity rather than attributes of relations between entities. the central theme of gt is that it simultaneously clusters entities and attributes on relations (words) influence:1 type:2 pair index:666 citer id:586 citer title:group and topic discovery from relations and text citer abstract:we present a probabilistic generative model of entity relationships and textual attributes that simultaneously discovers groups among the entities and topics among the corresponding text. block-models of relationship data have been studied in social network analysis for some time. here we simultaneously cluster in several modalities at once, incorporating the words associated with certain relationships. significantly, joint inference allows the discovery of groups to be guided by the emerging topics, and vice-versa. we present experimental results on two large data sets: sixteen years of bills put before the u.s. senate, comprising their corresponding text and voting records, and 43 years of similar data from the united nations. we show that in comparison with traditional, separate latent-variable models for words or blockstructures for votes, the group-topic models joint inference improves both the groups and topics discovered. categories and subject descriptors citee id:592 citee title:topic and role discovery in social networks citee abstract: previous work in social network analysis (sna) has modeled the existence of links from one en - tity to another, but not the language content or topics on those links we present the author - recipient - topic (art) model for social network analysis, which learns topic distributions based on the direction - sensitive messages sent between en - tities the model builds on latent dirichlet al - location (lda) and the author - topic (at) model, adding the key attribute that distribution over top - ics is conditioned distinctly on both the sender and recipient steering the discovery of topics accord - - ing to the relationships between people we give surrounding text:likewise, multiple different divisions of entities into groups are made possible by conditioning them on the topics. the importance of modeling the language associated with interactions between people has recently been demonstrated in the author-recipient-topic (art) model [***]<1>. in art the words in a message between people in a network are generated conditioned on the author, recipients and a set of topics that describes the message. [13]<1> as it takes advantage of information from different modalities by conditioning group membership on topics. in this sense, the gt model draws inspiration from the role-author-recipient-topic (rart) model [***]<1>. as an extension of art model, rart clusters together entities with similar roles influence:1 type:1 pair index:667 citer id:586 citer title:group and topic discovery from relations and text citer abstract:we present a probabilistic generative model of entity relationships and textual attributes that simultaneously discovers groups among the entities and topics among the corresponding text. block-models of relationship data have been studied in social network analysis for some time. here we simultaneously cluster in several modalities at once, incorporating the words associated with certain relationships. significantly, joint inference allows the discovery of groups to be guided by the emerging topics, and vice-versa. we present experimental results on two large data sets: sixteen years of bills put before the u.s. senate, comprising their corresponding text and voting records, and 43 years of similar data from the united nations. we show that in comparison with traditional, separate latent-variable models for words or blockstructures for votes, the group-topic models joint inference improves both the groups and topics discovered. categories and subject descriptors citee id:515 citee title:estimation and prediction for stochastic blockstructures citee abstract:this paper we extend the approach of snijders andnowicki (1997) to the case where relations can be directedand can have an arbirary set of possible values,and where the number of classes is arbitraryfi the vectorof attributes x specifying the class structure is consideredto be unobserved (latent)fi conditional on thevector x, we model y using a generalization of the pairdependentstochastic blockmodelfi the model can be regardedas a mixture model because the classes to whichthe vertices surrounding text:social scientists have conducted extensive research on group detection, especially in fields such as anthropology [8]<3> and political science [11, 7]<3>. recently, statisticians and computer scientists have begun to develop models that specifically discover group memberships [15, 3, ***, 13]<2>. one such model is the stochastic blockstructures model [***]<2>, which discovers the latent structure, groups or classes based on pair-wise relation data. recently, statisticians and computer scientists have begun to develop models that specifically discover group memberships [15, 3, ***, 13]<2>. one such model is the stochastic blockstructures model [***]<2>, which discovers the latent structure, groups or classes based on pair-wise relation data. a particular relation holds between a pair of entities (people, countries, organizations, etc. the relations between all the entities can be represented with a directed or undirected graph. the class assignments can be inferred from a graph of observed relations or link data using gibbs sampling [***]<2>. this model is extended in [13]<2> to automatically select an arbitrary number of groups by using a chinese restaurant process prior. our notation is summarized in table 1, and the graphical model representation of the model is shown in figure 1. without considering the topic of an event, or by treating all events in a corpus as reflecting a single topic, the simplified model (only the right part of figure 1) becomes equivalent to the stochastic blockstructures model [***]<2>. to match the blockstructures model, each event defines a relationship, e. note that we adopt conjugate priors in our setting, and thus we can easily integrate out ,  and to decrease the uncertainty associated with them. this simplifies the sampling since we do not need to sample ,  and at all, unlike in [***]<2>. in our case we need to compute the conditional distribution p(gstw,v, g. experimental results we present experiments applying the gt model to the voting records of members of two legislative bodies: the us senate and the un general assembly. for comparison, we present the results of a baseline method that first uses a mixture of unigrams to discover topics and associate a topic with each resolution, and then runs the blockstructures model [***]<2> separately on the resolutions assigned to each topic. this baseline approach is similar to the gt model in that it discovers both groups and topics, and has different group assignments on different topics influence:1 type:2 pair index:668 citer id:586 citer title:group and topic discovery from relations and text citer abstract:we present a probabilistic generative model of entity relationships and textual attributes that simultaneously discovers groups among the entities and topics among the corresponding text. block-models of relationship data have been studied in social network analysis for some time. here we simultaneously cluster in several modalities at once, incorporating the words associated with certain relationships. significantly, joint inference allows the discovery of groups to be guided by the emerging topics, and vice-versa. we present experimental results on two large data sets: sixteen years of bills put before the u.s. senate, comprising their corresponding text and voting records, and 43 years of similar data from the united nations. we show that in comparison with traditional, separate latent-variable models for words or blockstructures for votes, the group-topic models joint inference improves both the groups and topics discovered. categories and subject descriptors citee id:593 citee title:parliamentary group and individual voting behavior in finnish parliamentin year 2003 : a group cohesion and voting similarity analysis citee abstract:although group cohesion studies are rather common elsewhere, the last analyses of the finnish parliament eduskunta were published in the 1960s. this article provides, firstly, a fresh group cohesion analysis using the agreement index, which is a modified version of the classic rice index. secondly, two advanced voting similarity analyses, together with a new easy-to-understand way of illustrating the results, are provided. where the agreement index operates at the parliamentary party group level, the voting similarity analyses are able to analyse and illustrate the individual mp level. the article is partly methodological in testing the voting similarity methods, however, it also provides insight into the recent voting behaviour within eduskunta. surrounding text:they apply this model to voting data in the 108th us senate where the behavior of an entity is its vote on a resolution. a similar model is developed in [***]<2> that examines group cohesion and voting similarity in the finnish parliament. we apply our gt model also to voting data. we apply our gt model also to voting data. however, unlike [12, ***]<2>, since our goal is to cluster entities based on the similarity of their voting patterns, we are only interested in whether a pair of entities voted the same or differently, not their actual yes/no votes. two resolutions on the same topic may differ only in their goal (e influence:2 type:2 pair index:669 citer id:596 citer title:hidden markov model induction by bayesian model merging citer abstract:this paper describes a technique for learning both the number of states and the topology of hidden markov models from examples. the induction process starts with the most speci.c model consistent with the training data and generalizes by successively merging states. both the choice of states to merge and the stopping criterion are guided by the bayesian posterior probability. we compare our algorithm with the baum-welch method of estimating .xed-size models, and .nd that it can induce minimal hmms from data in cases where .xed estimation does not converge or requires redundant parameters to converge citee id:597 citee title:hidden markov models in molecular biology: new algorithms and applications citee abstract:hidden markov models (hmms) can be applied to several important problems in molecular biology. we introduce a new convergent learning algorithm for hmms that, unlike the classical baum-welch algorithm is smooth and can be applied on-line or in batch mode, with or without the usual viterbi most likely path approximation. left-right hmms with insertion and deletion states are then trained to represent several protein families including immunoglobulins and kinases. in all cases, the models surrounding text:cation and alignment (haussler, krogh, mian & sj olander, 1992. baldi, chauvin, hunkapiller & mcclure, 1993)[***]<3>[6]<3>. practitioners have typically chosen the hmm topology by hand, so that learning the hmm from sample data means estimating only a influence:2 type:3 pair index:670 citer id:596 citer title:hidden markov model induction by bayesian model merging citer abstract:this paper describes a technique for learning both the number of states and the topology of hidden markov models from examples. the induction process starts with the most speci.c model consistent with the training data and generalizes by successively merging states. both the choice of states to merge and the stopping criterion are guided by the bayesian posterior probability. we compare our algorithm with the baum-welch method of estimating .xed-size models, and .nd that it can induce minimal hmms from data in cases where .xed estimation does not converge or requires redundant parameters to converge citee id:598 citee title:statistical decision theory and bayesian analysis citee abstract:"the outstanding strengths of the book are its topic coverage, references, exposition, examples and problem sets... this book is an excellent addition to any mathematical statistician's library." -bulletin of the american mathematical society in this new edition the author has added substantial material on bayesian analysis, including lengthy new sections on such important topics as empirical and hierarchical bayes analysis, bayesian calculation, bayesian communication, and group decision making. with these changes, the book can be used as a self-contained introduction to bayesian analysis. in addition, much of the decision-theoretic portion of the text was updated, including new sections covering such modern topics as minimax multivariate (stein) estimation. surrounding text:tting the model parameters. since both the transition and the emission probabilities are given by multinomial distributions it is natural to use a dirichlet conjugate prior in this case (berger, 1985)[***]<1>. the effect of this prior is equivalent to having a number of virtual samples for each of the possible transitions and emissions which are added to the actual samples when it comes to estimating the most likely parameter settings influence:2 type:1 pair index:671 citer id:596 citer title:hidden markov model induction by bayesian model merging citer abstract:this paper describes a technique for learning both the number of states and the topology of hidden markov models from examples. the induction process starts with the most speci.c model consistent with the training data and generalizes by successively merging states. both the choice of states to merge and the stopping criterion are guided by the bayesian posterior probability. we compare our algorithm with the baum-welch method of estimating .xed-size models, and .nd that it can induce minimal hmms from data in cases where .xed estimation does not converge or requires redundant parameters to converge citee id:599 citee title:protein modeling using hidden markov models: analysis of globins citee abstract:we apply hidden markov models (hmms) to theproblem of statistical modeling and multiple alignmentof protein familiesfi a variant of the expectation maximization(em) algorithm known as the viterbi algorithmis used to obtain the statistical model from theunaligned sequencesfi in a detailed series of experiments,we have taken 400 unaligned globin sequences,and produced a statistical model entirely automaticallyfrom the primary (unaligned) sequences using no priorknowledge of globin surrounding text:cation and alignment (haussler, krogh, mian & sj olander, 1992. baldi, chauvin, hunkapiller & mcclure, 1993)[2]<3>[***]<3>. practitioners have typically chosen the hmm topology by hand, so that learning the hmm from sample data means estimating only a influence:2 type:3 pair index:672 citer id:596 citer title:hidden markov model induction by bayesian model merging citer abstract:this paper describes a technique for learning both the number of states and the topology of hidden markov models from examples. the induction process starts with the most speci.c model consistent with the training data and generalizes by successively merging states. both the choice of states to merge and the stopping criterion are guided by the bayesian posterior probability. we compare our algorithm with the baum-welch method of estimating .xed-size models, and .nd that it can induce minimal hmms from data in cases where .xed estimation does not converge or requires redundant parameters to converge citee id:183 citee title:a study of grammatical inference citee abstract:grammatical inference is an inductive process of discovering an acceptable grammar for a language, on the basis of finite samples from the language. the study has the goals of devising useful inference procedures and of demonstrating a sound formal basis for such procedures. it states the general grammatical inference problem for formal languages, reviews previous work, establishes definitions and notation, and states a position on evaluation measures. it indicates a solution for a particular class of grammatical inference problems, based on an assumed probabilistic structure surrounding text:nite-state model space search guided by a (non-probabilistic) goodness measure. horning (1969)[***]<2> describes a bayesian grammar induction procedure that searches the model space exhaustively for the map model. the procedure provably influence:3 type:2 pair index:673 citer id:596 citer title:hidden markov model induction by bayesian model merging citer abstract:this paper describes a technique for learning both the number of states and the topology of hidden markov models from examples. the induction process starts with the most speci.c model consistent with the training data and generalizes by successively merging states. both the choice of states to merge and the stopping criterion are guided by the bayesian posterior probability. we compare our algorithm with the baum-welch method of estimating .xed-size models, and .nd that it can induce minimal hmms from data in cases where .xed estimation does not converge or requires redundant parameters to converge citee id:600 citee title:the estimation of stochastic context-free grammars using the inside-outside algorithm citee abstract:a combination of a transformation algorithm between some stochastic context-free grammars and the inside-outside algorithm allows us to define a method for the estimation of the rule probabilities of stochastic context-free grammars with the same time complexity as the inside-outside algorithm. the transformation algorithm relates stochastic context-free grammars, whose characteristic grammar is proper and does not have single rules, to stochastic context-free grammars in chomsky normal form surrounding text:this is currently being carried out in collaboration with chuck wooters at icsi, and it appears that our merging algorithm is able to produce linguistically adequate phonetic models. another direction involves an extension of the model space to stochastic context-free grammars, for which a standard estimation method analogous to baum-welch exists (lari &young,1990)[***]<3>. thenotionsofsampleincorporationandmergingcarryovertothisdomain (with merging now involving the non-terminals of the cfg), but need to be complemented with a mechanism that adds new non-terminals to create hierarchical structure (which we call chunking) influence:3 type:3 pair index:674 citer id:596 citer title:hidden markov model induction by bayesian model merging citer abstract:this paper describes a technique for learning both the number of states and the topology of hidden markov models from examples. the induction process starts with the most speci.c model consistent with the training data and generalizes by successively merging states. both the choice of states to merge and the stopping criterion are guided by the bayesian posterior probability. we compare our algorithm with the baum-welch method of estimating .xed-size models, and .nd that it can induce minimal hmms from data in cases where .xed estimation does not converge or requires redundant parameters to converge citee id:601 citee title:learning automata from ordered examples citee abstract:connectionist learning models have had considerable empirical success, but it is hard to characterize exactly what they learn. the learning of finite-state languages (fsl) from example strings is a domain which has been extensively studied and might provide an opportunity to help understand connectionist learning. a major problem is that traditional fsl learning assumes the storage of all examples and thus violates connectionist principles. this paper presents a provably correct algorithm for inferring any minimum-state deterministic finite-state automata (fsa) from a complete ordered sample using limited total storage and without storing example strings. the algorithm is an iterative strategy that uses at each stage a current encoding of the data considered so far, and one single sample string. one of the crucial advantages of our algorithm is that the total amount of space used in the course of learning for encoding any finite prefix of the sample is polynomial in the size of the inferred minimum state deterministic fsa. the algorithm is also relatively efficient in time and has been implemented. more importantly, there is a connectionist version of the algorithm that preserves these properties. the connectionist version requires much more structure than the usual models and has been implemented using the rochester connectionist simulator. we also show that no machine with finite working storage can iteratively identify the fsl from arbitrary presentations. surrounding text:the incremental augmentation of the hmm by merging in new samples has some of the . avor of the algorithm used by porat & feldman (1991)[***]<2> to induce a . nite-state model from positive-only, ordered examples influence:3 type:3 pair index:675 citer id:596 citer title:hidden markov model induction by bayesian model merging citer abstract:this paper describes a technique for learning both the number of states and the topology of hidden markov models from examples. the induction process starts with the most speci.c model consistent with the training data and generalizes by successively merging states. both the choice of states to merge and the stopping criterion are guided by the bayesian posterior probability. we compare our algorithm with the baum-welch method of estimating .xed-size models, and .nd that it can induce minimal hmms from data in cases where .xed estimation does not converge or requires redundant parameters to converge citee id:227 citee title:an introduction to hidden markov models citee abstract:the basic theory of markov chains has been known to mathematicians and engineers for close to 80 years, but it is only in the past decade that it has been applied explicitly to problems in speech processing. one of the major reasons why speech models, based on markov chains, have not been developed until recently was the lack of a method for optimizing the parameters of the markov model to match observed signal patterns. such a method was proposed in the late 1960's and was immediately applied to speech processing in several research institutions. continued refinements in the theory and implementation of markov modelling techniques have greatly enhanced the method, leading to a wide range of applications of these models. it is the purpose of this tutorial paper to give an introduction to the theory of markov models, and to illustrate how they have been applied to problems in speech recognition. surrounding text:nite-state automata, where both the transitions between states and the generation of output symbols are governed by probability distributions. hmms have been important in speech recognition (rabiner & juang, 1986)[***]<1>, cryptography, and more recently in other areas such as protein classi. cation and alignment (haussler, krogh, mian & sj olander, 1992. 2 hidden markov models for lack of space we cannot give a full introduction to hmms here. see rabiner & juang (1986)[***]<1> for details. brie influence:1 type:1 pair index:676 citer id:596 citer title:hidden markov model induction by bayesian model merging citer abstract:this paper describes a technique for learning both the number of states and the topology of hidden markov models from examples. the induction process starts with the most speci.c model consistent with the training data and generalizes by successively merging states. both the choice of states to merge and the stopping criterion are guided by the bayesian posterior probability. we compare our algorithm with the baum-welch method of estimating .xed-size models, and .nd that it can induce minimal hmms from data in cases where .xed estimation does not converge or requires redundant parameters to converge citee id:191 citee title:a universal prior for integers and estimation by minimum description length citee abstract:an earlier introduced estimation principle, which calls for minimization of the number of bits required to write down the observed data, has been reformulated to extend the classical maximum likelihood principle. the principle permits estimation of the number of the parameters in statistical models in addition to their values and even of the way the parameters appear in the models; i.e., of the model structures. the principle rests on a new way to interpret and construct a universal prior distribution for the integers, which makes sense even when the parameter is an individual object. truncated real-valued parameters are converted to integers by dividing them by their precision, and their prior is determined from the universal prior for the integers by optimizing the precision. surrounding text:m , and can be viewed as a description length prior that penalizes models according to their coding length (rissanen, 1983. wallace & freeman, 1987)[***]<3>[16]<3>. the constants in this mdl term had to be adjusted by hand from examples of desirable generalization influence:3 type:3 pair index:677 citer id:596 citer title:hidden markov model induction by bayesian model merging citer abstract:this paper describes a technique for learning both the number of states and the topology of hidden markov models from examples. the induction process starts with the most speci.c model consistent with the training data and generalizes by successively merging states. both the choice of states to merge and the stopping criterion are guided by the bayesian posterior probability. we compare our algorithm with the baum-welch method of estimating .xed-size models, and .nd that it can induce minimal hmms from data in cases where .xed estimation does not converge or requires redundant parameters to converge citee id:514 citee title:estimation and inference by compact coding citee abstract:the systematic variation within a set of data, as represented by a usual statistical model, may be used to encode the data in a more compact form than would be possible if they were considered to be purely random. the encoded form has two parts. the first states the inferred estimates of the unknown parameters in the model, the second states the data using an optimal code based on the data probability distribution implied by those parameter estimates. choosing the model and the estimates that give the most compact coding leads to an interesting general inference procedure. in its strict form it has great generality and several nice properties but is computationally infeasible. an approximate form is developed and its relation to other methods is explored surrounding text:m , and can be viewed as a description length prior that penalizes models according to their coding length (rissanen, 1983. wallace & freeman, 1987)[13]<3>[***]<3>. the constants in this mdl term had to be adjusted by hand from examples of desirable generalization influence:3 type:3 pair index:678 citer id:613 citer title:hmms and coupled hmms for multi-channel eeg classification citer abstract:a variety of coupled hmms (chmms) have recently been proposed as extensions of hmm to better characterize multiple interdependent sequences. this paper introduces a novel distance coupled hmm. it then compares the performance of several hmm and chmm models for a multi-channel eeg classification problem. the results show that, of all approaches examined, the multivariate hmm that has low computational complexity surprisingly outperforms all other models citee id:614 citee title:motor imagery and direct brain-computer communication citee abstract:motor imagery can modify the neuronal activity in the primary sensorimotor areas in a very similar way as observable with a real executed movement. one part of eeg-based brain-computer interfaces (bci) is based on the recording and classification of circumscribed and transient eeg changes during different types of motor imagery such as, e.g., imagination of left-hand, right-hand, or foot movement. features such as, e.g., band power or adaptive autoregressive parameters are either extracted in bipolar eeg recordings overlaying sensorimotor areas or from an array of electrodes located over central and neighboring areas. for the classification of the features, linear discrimination analysis and neural networks are used. characteristic for the graz bci is that a classifier is set up in a learning session and updated after one or more sessions with online feedback using the procedure of rapid prototyping. as a result, a discrimination of two brain states (e.g., leftversus right-hand movement imagination) can be reached within only a few days of training. at this time, a tetraplegic patient is able to operate an eeg-based control of a hand orthosis with nearly 100% classification accuracy by mental imagination of specific motor commands surrounding text:introduction classification of eeg is an important part of eegbased brain-computer interfaces. an overview of eegbased brain-computer interface systems is presented by pfurtscheller and neuper [***]<0>. they summarize several approaches, such as linear discrimination analysis (lda), artificial neural networks (ann) and hmms, for classifying features extracted from raw eeg data influence:2 type:3 pair index:679 citer id:613 citer title:hmms and coupled hmms for multi-channel eeg classification citer abstract:a variety of coupled hmms (chmms) have recently been proposed as extensions of hmm to better characterize multiple interdependent sequences. this paper introduces a novel distance coupled hmm. it then compares the performance of several hmm and chmm models for a multi-channel eeg classification problem. the results show that, of all approaches examined, the multivariate hmm that has low computational complexity surprisingly outperforms all other models citee id:615 citee title:using timedependent neural networks for eeg classification citee abstract:this paper compares two different topologies of neural networks. they are used to classify single trial electroencephalograph (eeg) data from a brainccomputer interface (bci). a short introduction to time series classification is given, and the used classifiers are described. standard multilayer perceptrons (mlps) are used as a standard method for classification. they are compared to finite impulse response (fir) mlps, which use fir filters instead of static weights to allow temporal processing inside the classifier. a theoretical comparison of the two architectures is presented. the results of a bci experiment with three different subjects are given and discussed. these results demonstrate the higher performance of the fir mlp compared with the standard mlp. surrounding text:when a neural network is used for eeg analysis, it is often modified to exploit time information. for example, haselsteiner and pfurtscheller [***]<2> use a time-delayed neural network and collect features using an adaptive autoregressive (ar) model. hmms have been heavily researched and used for the past several decades, especially in the speech recognition area [3]<1>, and successfully applied to a wide variety of applications, including eeg classification [4]<1>, [5]<1> influence:1 type:2 pair index:680 citer id:613 citer title:hmms and coupled hmms for multi-channel eeg classification citer abstract:a variety of coupled hmms (chmms) have recently been proposed as extensions of hmm to better characterize multiple interdependent sequences. this paper introduces a novel distance coupled hmm. it then compares the performance of several hmm and chmm models for a multi-channel eeg classification problem. the results show that, of all approaches examined, the multivariate hmm that has low computational complexity surprisingly outperforms all other models citee id:189 citee title:a tutorial on hidden markov models and selected applications in speech recognition citee abstract:this tutorial provides an overview of the basic theory of hidden markov models (hmms) as originated by l.e. baum and t. petrie (1966) and gives practical details on methods of implementation of the theory along with a description of selected applications of the theory to distinct problems in speech recognition. results from a number of original sources are combined to provide a single source of acquiring the background required to pursue further this area of research. the author first reviews the theory of discrete markov chains and shows how the concept of hidden states, where the observation is a probabilistic function of the state, can be used effectively. the theory is illustrated with two simple examples, namely coin-tossing, and the classic balls-in-urns system. three fundamental problems of hmms are noted and several practical techniques for solving these problems are given. the various types of hmms that have been studied, including ergodic as well as left-right models, are described surrounding text:for example, haselsteiner and pfurtscheller [2]<2> use a time-delayed neural network and collect features using an adaptive autoregressive (ar) model. hmms have been heavily researched and used for the past several decades, especially in the speech recognition area [***]<1>, and successfully applied to a wide variety of applications, including eeg classification [4]<1>, [5]<1>. works on eeg classification usually apply hmms to the timechanging feature vectors extracted by an ar model or by some other digital signal processing techniques influence:1 type:1 pair index:681 citer id:613 citer title:hmms and coupled hmms for multi-channel eeg classification citer abstract:a variety of coupled hmms (chmms) have recently been proposed as extensions of hmm to better characterize multiple interdependent sequences. this paper introduces a novel distance coupled hmm. it then compares the performance of several hmm and chmm models for a multi-channel eeg classification problem. the results show that, of all approaches examined, the multivariate hmm that has low computational complexity surprisingly outperforms all other models citee id:458 citee title:eeg pattern recognitionarousal states detection and classification citee abstract:we use an electroencephalogram (eeg) to detect the arousal states of humans. the eeg patterns fluctuating between waking and sleep are described as several features extracted from the moving time windows. the significant feature, mean frequency (mf), is used for arousal states detection. the fluctuations of the eeg mean frequency are characterized as hidden markov models (hmms), which well estimate the next possible arousal state. in this hmm, single current-to-next state transition probability is considered along with global states transitions. both local contextual effects (lce) and global contextual effects (gce) autoregressive hmms are used to estimate and detect the transitions from waking to sleep. the validity of our proposed model is verified via a behavior measure-the correct rate of the subject's responses to auditory stimuli. in our study, the estimated values of mean frequency by gce hmm show high correlation with the behavior measure. this high recognition rate makes arousal states detection practicable. furthermore, the lce hmms are also applicable for artifacts rejection. in real-time arousal states detection, alarms are given whenever the subject's vigilant states turn into drowsy states. this system has many applications, for example, it can be used to maintain long term vehicle driver's arousal surrounding text:for example, haselsteiner and pfurtscheller [2]<2> use a time-delayed neural network and collect features using an adaptive autoregressive (ar) model. hmms have been heavily researched and used for the past several decades, especially in the speech recognition area [3]<1>, and successfully applied to a wide variety of applications, including eeg classification [***]<1>, [5]<1>. works on eeg classification usually apply hmms to the timechanging feature vectors extracted by an ar model or by some other digital signal processing techniques. works on eeg classification usually apply hmms to the timechanging feature vectors extracted by an ar model or by some other digital signal processing techniques. huang et al [***]<1> use the mean frequency features, calculated from fft spectrum, for detecting the arousal state changes. obermaier, guger, and pfurtscheller [5]<1> compare lda and hmms on bandpass-filtered feature vectors and experiment with the structure parameters of hmms influence:1 type:1 pair index:682 citer id:613 citer title:hmms and coupled hmms for multi-channel eeg classification citer abstract:a variety of coupled hmms (chmms) have recently been proposed as extensions of hmm to better characterize multiple interdependent sequences. this paper introduces a novel distance coupled hmm. it then compares the performance of several hmm and chmm models for a multi-channel eeg classification problem. the results show that, of all approaches examined, the multivariate hmm that has low computational complexity surprisingly outperforms all other models citee id:612 citee title:hmm used for the offline classification of eeg data citee abstract:hidden markov models (hmm) are introduced for the offline classification of single-trail eeg data in a brain-computer-interface (bci). the hmms are used to classify hjorth parameters calculated from bipolar eeg data, recorded during the imagination of a left or right hand movement. the effects of different types of hmms on the recognition rate are discussed. furthermore a comparison of the results achieved with the linear discriminant (ld) and the hmm, is presented. surrounding text:for example, haselsteiner and pfurtscheller [2]<2> use a time-delayed neural network and collect features using an adaptive autoregressive (ar) model. hmms have been heavily researched and used for the past several decades, especially in the speech recognition area [3]<1>, and successfully applied to a wide variety of applications, including eeg classification [4]<1>, [***]<1>. works on eeg classification usually apply hmms to the timechanging feature vectors extracted by an ar model or by some other digital signal processing techniques. huang et al [4]<1> use the mean frequency features, calculated from fft spectrum, for detecting the arousal state changes. obermaier, guger, and pfurtscheller [***]<1> compare lda and hmms on bandpass-filtered feature vectors and experiment with the structure parameters of hmms. penny and roberts [6]<1> conclude, based on experiments on synthetic data, that hmms are capable of detecting nonstationary changes and are thus perfect for eeg analysis influence:1 type:2 pair index:683 citer id:613 citer title:hmms and coupled hmms for multi-channel eeg classification citer abstract:a variety of coupled hmms (chmms) have recently been proposed as extensions of hmm to better characterize multiple interdependent sequences. this paper introduces a novel distance coupled hmm. it then compares the performance of several hmm and chmm models for a multi-channel eeg classification problem. the results show that, of all approaches examined, the multivariate hmm that has low computational complexity surprisingly outperforms all other models citee id:580 citee title:gaussian observation hidden markov models for eeg analysis citee abstract:we show, using a number of synthetic data sets, that hmms can detect the changesin dc levels, correlation, frequency and coherence that are typical of the nonstationarychanges in an eeg signalfi we also show that the number of hidden states in an hmmcan be chosen using a cluster analysis of derived featuresfi our experiments are based onthe use of hmms with gaussian observation densities trained on autoregressive (ar)or multivariate autoregressive (mar) coefficientsfi the extraction of these surrounding text:obermaier, guger, and pfurtscheller [5]<1> compare lda and hmms on bandpass-filtered feature vectors and experiment with the structure parameters of hmms. penny and roberts [***]<1> conclude, based on experiments on synthetic data, that hmms are capable of detecting nonstationary changes and are thus perfect for eeg analysis. they point out that operating hmms on ar coefficients is fundamentally flawed because the windowing procedure used in ar models may lead to incorrect estimates of state and state transitions in an hmm model influence:1 type:2 pair index:684 citer id:613 citer title:hmms and coupled hmms for multi-channel eeg classification citer abstract:a variety of coupled hmms (chmms) have recently been proposed as extensions of hmm to better characterize multiple interdependent sequences. this paper introduces a novel distance coupled hmm. it then compares the performance of several hmm and chmm models for a multi-channel eeg classification problem. the results show that, of all approaches examined, the multivariate hmm that has low computational complexity surprisingly outperforms all other models citee id:388 citee title:coupled hidden markov models for modeling interactive processes citee abstract:we present methods for coupling hidden markov models (hmms) to model systems of multiple interacting processes. the resulting models have multiple state variables that are temporally coupled via matrices of conditional probabilities. we introduce a deterministic o(t (cn) 2 ) approximation for maximum a posterior (map) state estimation which enables fast classification and parameter estimation via expectation maximization. an "n-heads" dynamic programming algorithm samples from the highest surrounding text:using one univariate hmm for each channel and then combining these hmms is another one. recently, chmm models have been proposed to better model multiple interacting time series processes [***]<2>, [8]<2> and they seem to work better than hmms. also some generalized hmm models have been suggested to enrich the hmm model for specific applications [9]<2>, [10]<2>. em algorithm) for standard hmm models. several typical examples from recent literature are chmms [***]<2>, event-coupled hmms [15]<2>, factorial hmms (fhmms) [10]<2> and input-output hmms (iohmms) [9]<2>, as shown in fig. 2. there have been several variations of the standard chmm for which the model size and inference problems are more tractable. coupled hmms proposed by brand [***]<2> is one of them. in his paper, brand substitutes the joint conditional probability by the product of all marginal conditional probabilities, i influence:1 type:1 pair index:685 citer id:613 citer title:hmms and coupled hmms for multi-channel eeg classification citer abstract:a variety of coupled hmms (chmms) have recently been proposed as extensions of hmm to better characterize multiple interdependent sequences. this paper introduces a novel distance coupled hmm. it then compares the performance of several hmm and chmm models for a multi-channel eeg classification problem. the results show that, of all approaches examined, the multivariate hmm that has low computational complexity surprisingly outperforms all other models citee id:516 citee title:estimation of coupled hidden markov models with application to biosignal interaction modelling citee abstract:coupled hidden markov models (chmm) are a new tool which model interactions in state space rather than observation space. thus they may reveal coupling where classical tools such as correlation fail. in this paper we derive the maximum likelihood equations for the chmm parameters using the expectation maximisation algorithm. the use of the models is demonstrated in simulated data, as well as in biomedical signal analysis. surrounding text:using one univariate hmm for each channel and then combining these hmms is another one. recently, chmm models have been proposed to better model multiple interacting time series processes [7]<2>, [***]<2> and they seem to work better than hmms. also some generalized hmm models have been suggested to enrich the hmm model for specific applications [9]<2>, [10]<2>. 1) (3) this formulation is erroneous since the right hand side is not a properly defined probability density (does not sum up to one). rezek and roberts [***]<2> use a decoupled forward variable for each hmm chain in a chmm, that is an approximation of the true forward variables. the computational complexity is reduced but still exponential in the number of hmm chains influence:2 type:1 pair index:686 citer id:613 citer title:hmms and coupled hmms for multi-channel eeg classification citer abstract:a variety of coupled hmms (chmms) have recently been proposed as extensions of hmm to better characterize multiple interdependent sequences. this paper introduces a novel distance coupled hmm. it then compares the performance of several hmm and chmm models for a multi-channel eeg classification problem. the results show that, of all approaches examined, the multivariate hmm that has low computational complexity surprisingly outperforms all other models citee id:616 citee title:input-output hmms for sequence processing citee abstract:we consider problems of sequence processing and propose a solution based on a discrete-state model in order to represent past context. we introduce a recurrent connectionist architecture having a modular structure that associates a subnetwork to each state. the model has a statistical interpretation we call input-output hidden markov model (iohmm). it can be trained by the estimation-maximization (em) or generalized em (gem) algorithms, considering state trajectories as missing data, which decouples temporal credit assignment and actual parameter estimation. the model presents similarities to hidden markov models (hmms), but allows us to map input sequences to output sequences, using the same processing style as recurrent neural networks. iohmms are trained using a more discriminant learning paradigm than hmms, while potentially taking advantage of the em algorithm. we demonstrate that iohmms are well suited for solving grammatical inference problems on a benchmark problem. experimental results are presented for the seven tomita grammars, showing that these adaptive models can attain excellent generalization surrounding text:recently, chmm models have been proposed to better model multiple interacting time series processes [7]<2>, [8]<2> and they seem to work better than hmms. also some generalized hmm models have been suggested to enrich the hmm model for specific applications [***]<2>, [10]<2>. in this paper, we propose a new chmm formulation, named distance coupled hmm (dchmm), that is related to the mixed memory markov models [11]<1>. em algorithm) for standard hmm models. several typical examples from recent literature are chmms [7]<2>, event-coupled hmms [15]<2>, factorial hmms (fhmms) [10]<2> and input-output hmms (iohmms) [***]<2>, as shown in fig. 2. 2(b) is a specific type of coupled hmms, developed by kristjansson, frey, and huang [15]<2>, for modeling a class of loosely coupled time series where only the onsets of events are coupled in time. bengio and frasconi [***]<2> develop the iohmms (fig. 2(d)) to address the input-output sequence pair modeling problem. different. a certain independence assumption of inputs does not apply to the hidden states and the em algorithm used in [***]<2> cannot be used for general chmms. the fhmm shown in fig influence:3 type:2 pair index:687 citer id:613 citer title:hmms and coupled hmms for multi-channel eeg classification citer abstract:a variety of coupled hmms (chmms) have recently been proposed as extensions of hmm to better characterize multiple interdependent sequences. this paper introduces a novel distance coupled hmm. it then compares the performance of several hmm and chmm models for a multi-channel eeg classification problem. the results show that, of all approaches examined, the multivariate hmm that has low computational complexity surprisingly outperforms all other models citee id:549 citee title:factorial hidden markov models citee abstract:we present a framework for learning in hidden markov models with distributed staterepresentations. within this framework, we derive a learning algorithm based on theexpectation{maximization (em) procedure for maximum likelihood estimation. anal-ogous to the standard baum-welch update rules, the m-step of our algorithm is exactand can be solved analytically. however, due to the combinatorial nature of the hiddenstate representation, the exact e-step is intractable. a simple and tractable mean eldapproximation is derived. empirical results on a set of problems suggest that both themean eld approximation and gibbs sampling are viable alternatives to the computa-tionally expensive exact algorithm. surrounding text:recently, chmm models have been proposed to better model multiple interacting time series processes [7]<2>, [8]<2> and they seem to work better than hmms. also some generalized hmm models have been suggested to enrich the hmm model for specific applications [9]<2>, [***]<2>. in this paper, we propose a new chmm formulation, named distance coupled hmm (dchmm), that is related to the mixed memory markov models [11]<1>. em algorithm) for standard hmm models. several typical examples from recent literature are chmms [7]<2>, event-coupled hmms [15]<2>, factorial hmms (fhmms) [***]<2> and input-output hmms (iohmms) [9]<2>, as shown in fig. 2 influence:3 type:2 pair index:688 citer id:613 citer title:hmms and coupled hmms for multi-channel eeg classification citer abstract:a variety of coupled hmms (chmms) have recently been proposed as extensions of hmm to better characterize multiple interdependent sequences. this paper introduces a novel distance coupled hmm. it then compares the performance of several hmm and chmm models for a multi-channel eeg classification problem. the results show that, of all approaches examined, the multivariate hmm that has low computational complexity surprisingly outperforms all other models citee id:617 citee title:mixed memory markov models: decomposing complex stochastic processes as mixtures of simpler ones citee abstract:fi we study markov models whose state spaces arise from the cartesian product of two or morediscrete random variablesfi we show how to parameterize the transition matrices of these models as a convexcombination---or mixture---of simpler dynamical modelsfi the parameters in these models admit a simpleprobabilistic interpretation and can be fitted iteratively by an expectation-maximization (em) procedurefi wederive a set of generalized baum-welch updates for factorial hidden markov models that surrounding text:also some generalized hmm models have been suggested to enrich the hmm model for specific applications [9]<2>, [10]<2>. in this paper, we propose a new chmm formulation, named distance coupled hmm (dchmm), that is related to the mixed memory markov models [***]<1>. we examine some of these sophisticated models on the eeg classification problem and compare their performance against the simple aforementioned hmm models. c. learning a dchmm the em algorithm can be derived for learning a dchmm, as shown by saul and jordan [***]<1>. but the algorithm finally amounts to using statistics that are similar to forward variables but need factored approximation influence:1 type:1 pair index:689 citer id:613 citer title:hmms and coupled hmms for multi-channel eeg classification citer abstract:a variety of coupled hmms (chmms) have recently been proposed as extensions of hmm to better characterize multiple interdependent sequences. this paper introduces a novel distance coupled hmm. it then compares the performance of several hmm and chmm models for a multi-channel eeg classification problem. the results show that, of all approaches examined, the multivariate hmm that has low computational complexity surprisingly outperforms all other models citee id:387 citee title:coupled hidden markov models for complex action recognition citee abstract:we present algorithms for coupling and training hidden markov models (hmms) to model interacting processes, and demonstrate their superiority to conventional hmms in a vision task classifying two-handed actions. hmms are perhaps the most successful framework in perceptual computing for modeling and classifying dynamic behaviors, popular because they offer dynamic time warping, a training algorithm and a clear bayesian semantics. however the markovian framework makes strong restrictive assumptions about the system generating the signal-that it is a single process having a small number of states and an extremely limited state memory. the single-process model is often inappropriate for vision (and speech) applications, resulting in low ceilings on model performance. coupled hmms provide an efficient way to resolve many of these problems, and offer superior training speeds, model likelihoods, and robustness to initial conditions. surrounding text:b. coupled hmms various extended hmm models have been used to solve coupled sequence data analysis problems, such as complex human action recognition [***]<3>, traffic modeling [14]<3> and biosignal analysis [8]<3>. these new models aim to enhance the capabilities of standard hmm model by using more complex architectures, while still being able to utilize the established methodologies (e influence:3 type:2 pair index:690 citer id:613 citer title:hmms and coupled hmms for multi-channel eeg classification citer abstract:a variety of coupled hmms (chmms) have recently been proposed as extensions of hmm to better characterize multiple interdependent sequences. this paper introduces a novel distance coupled hmm. it then compares the performance of several hmm and chmm models for a multi-channel eeg classification problem. the results show that, of all approaches examined, the multivariate hmm that has low computational complexity surprisingly outperforms all other models citee id:618 citee title:modeling freeway traffic with coupled hmms citee abstract:we consider the problem of modeling loop detector data collected fromfreewaysfi the data, which is a vector time-series, contains the speed ofvehicles, averaged over a 30 second sampling window, at a number ofsites along the freewayfi we assume the measured speed at each locationis generated from a hidden discrete variable, which represents the underlyingstate (efigfi, congested or free-flowing) of the traffic at that point inspace and timefi we further assume that the hidden variables surrounding text:b. coupled hmms various extended hmm models have been used to solve coupled sequence data analysis problems, such as complex human action recognition [13]<3>, traffic modeling [***]<3> and biosignal analysis [8]<3>. these new models aim to enhance the capabilities of standard hmm model by using more complex architectures, while still being able to utilize the established methodologies (e influence:3 type:2 pair index:691 citer id:613 citer title:hmms and coupled hmms for multi-channel eeg classification citer abstract:a variety of coupled hmms (chmms) have recently been proposed as extensions of hmm to better characterize multiple interdependent sequences. this paper introduces a novel distance coupled hmm. it then compares the performance of several hmm and chmm models for a multi-channel eeg classification problem. the results show that, of all approaches examined, the multivariate hmm that has low computational complexity surprisingly outperforms all other models citee id:530 citee title:event-coupled hidden markov models citee abstract:inferences from time-series data can be greatly enhancedby taking into account multiple modalitiesfi in some cases,such as audio of speech and the corresponding video of lipgestures, the different time-series are tightly coupledfi weare interested in loosely-coupled time series where only theonset of events are coupled in timefi we present an extensionof the forward-backward algorithm that can be used for inferenceand learning in event-coupled hidden markov modelsand give results on a surrounding text:em algorithm) for standard hmm models. several typical examples from recent literature are chmms [7]<2>, event-coupled hmms [***]<2>, factorial hmms (fhmms) [10]<2> and input-output hmms (iohmms) [9]<2>, as shown in fig. 2. fig. 2(b) is a specific type of coupled hmms, developed by kristjansson, frey, and huang [***]<2>, for modeling a class of loosely coupled time series where only the onsets of events are coupled in time. bengio and frasconi [9]<2> develop the iohmms (fig influence:3 type:1 pair index:692 citer id:613 citer title:hmms and coupled hmms for multi-channel eeg classification citer abstract:a variety of coupled hmms (chmms) have recently been proposed as extensions of hmm to better characterize multiple interdependent sequences. this paper introduces a novel distance coupled hmm. it then compares the performance of several hmm and chmm models for a multi-channel eeg classification problem. the results show that, of all approaches examined, the multivariate hmm that has low computational complexity surprisingly outperforms all other models citee id:238 citee title:approximate learning of dynamic models citee abstract:inference is a key component in learning probabilistic models from partiallyobservable datafi when learning temporal models, each of the manyinference phases requires a complete traversal over a potentially verylong sequence; furthermore, the data structures propagated in this procedurecan be extremely large, making the whole process very demandingfiin , we describe an approximate inference algorithm for monitoringstochastic processes, and prove bounds on its approximation errorfi in surrounding text:kwon and murphy [14]<2> use chmm to model freeway traffic. they cast chmm in a more general framework called dynamic bayesian network (dbn) in which the approximate inference can be done using boyen-koller (bk) algorithm [***]<3>. murphy and weiss [17]<2> examine a factored version of the bk algorithm that has a complexity of o(tcnf+1), where t is the length of sequence and f the maximum fan-in of any node influence:3 type:1 pair index:693 citer id:613 citer title:hmms and coupled hmms for multi-channel eeg classification citer abstract:a variety of coupled hmms (chmms) have recently been proposed as extensions of hmm to better characterize multiple interdependent sequences. this paper introduces a novel distance coupled hmm. it then compares the performance of several hmm and chmm models for a multi-channel eeg classification problem. the results show that, of all approaches examined, the multivariate hmm that has low computational complexity surprisingly outperforms all other models citee id:619 citee title:the factored frontier algorithm for approximate inference in dbns citee abstract:the factored frontier (ff) algorithm is a simple approximate inference algorithm for dynamic bayesian networks (dbns). it is very similar to the fully factorized version of the boyen-koller (bk) algorithm, but instead of doing an exact update at every step followed by marginalisation (projection), it always works with factored distributions. hence it can be applied to models for which the exact update step is intractable. we show that ff is equivalent to (one iteration of) loopy belief... surrounding text:they cast chmm in a more general framework called dynamic bayesian network (dbn) in which the approximate inference can be done using boyen-koller (bk) algorithm [16]<3>. murphy and weiss [***]<2> examine a factored version of the bk algorithm that has a complexity of o(tcnf+1), where t is the length of sequence and f the maximum fan-in of any node. saul and jordan [11]<2> reduce the number of parameters in eq influence:3 type:1 pair index:694 citer id:613 citer title:hmms and coupled hmms for multi-channel eeg classification citer abstract:a variety of coupled hmms (chmms) have recently been proposed as extensions of hmm to better characterize multiple interdependent sequences. this paper introduces a novel distance coupled hmm. it then compares the performance of several hmm and chmm models for a multi-channel eeg classification problem. the results show that, of all approaches examined, the multivariate hmm that has low computational complexity surprisingly outperforms all other models citee id:594 citee title:growth transformations for functions on manifolds citee abstract:in this paper we look at the problem of maximizing a function p defined on a manifold m. although we shall be primarily concerned with the case where m is a certain polyhedron in a euclidean space rn and p is a polynomial with nonnegative coefficients defined on rn, some of our results are valid in greater generality. surrounding text:the transformation . can be applied to more general likelihood functions that are polynomials with positive coefficients (not necessarily homogeneous), according to a relaxation presented by baum and sell [***]<1>. one advantage of this learning algorithm is that it can be applied to minimize more complex objective functions for which em algorithm may be difficult to derive influence:3 type:1 pair index:695 citer id:613 citer title:hmms and coupled hmms for multi-channel eeg classification citer abstract:a variety of coupled hmms (chmms) have recently been proposed as extensions of hmm to better characterize multiple interdependent sequences. this paper introduces a novel distance coupled hmm. it then compares the performance of several hmm and chmm models for a multi-channel eeg classification problem. the results show that, of all approaches examined, the multivariate hmm that has low computational complexity surprisingly outperforms all other models citee id:620 citee title:maximum likelihood estimation for multivariate mixture observations of markov chains citee abstract:to use probabilistic functions of a markov chain to model certain parameterizations of the speech signal, we extend an estimation technique of liporace to the eases of multivariate mixtures, such as gaussian sums, and products of mixtures. we also show how these problems relate to liporace's original framework. surrounding text:experimental setting we want to mention a few details of the training of hmms here. juang, levinson, and sondhi [***]<3> point out that using mixture of gaussians as the observation model of hmm sometimes results in singularity problems during training. they suggest solving the problem by re-training from a different initialization influence:3 type:3 pair index:696 citer id:613 citer title:hmms and coupled hmms for multi-channel eeg classification citer abstract:a variety of coupled hmms (chmms) have recently been proposed as extensions of hmm to better characterize multiple interdependent sequences. this paper introduces a novel distance coupled hmm. it then compares the performance of several hmm and chmm models for a multi-channel eeg classification problem. the results show that, of all approaches examined, the multivariate hmm that has low computational complexity surprisingly outperforms all other models citee id:302 citee title:clustering sequences with hidden markov models citee abstract:this paper discusses a probabilistic model-based approach to clustering sequences, using hidden markov models (hmms)fi the problemcan be framed as a generalization of the standard mixturemodel approach to clustering in feature spacefi two primary issuesare addressedfi first, a novel parameter initialization procedure isproposed, and second, the more difficult problem of determiningthe number of clusters k, from the data, is investigatedfi experimentalresults indicate that the proposed surrounding text:rabiner et al [23]<3> find, through empirical study, that accurately estimating the means of gaussians is critical to learning good models for continuous hmms. in his experiment, smyth [***]<3> uses a clever initialization schemeusing the k-means algorithm to locate the means. we adopt the same strategy influence:3 type:1 pair index:697 citer id:626 citer title:identifying comparative sentences in text documents citer abstract:this paper studies the problem of identifying comparative sentences in text documents. the problem is related to but quite different from sentiment/opinion sentence identification or classification. sentiment classification studies the problem of classifying a document or a sentence based on the subjective opinion of the author. an important application area of sentiment/opinion identification is business intelligence as a product manufacturer always wants to know consumers opinions on its products. comparisons on the other hand can be subjective or objective. furthermore, a comparison is not concerned with an object in isolation. instead, it compares the object with others. an example opinion sentence is the sound quality of cd player x is poor. an example comparative sentence is the sound quality of cd player x is not as good as that of cd player y. clearly, these two sentences give different information. their language constructs are quite different too. identifying comparative sentences is also useful in practice because direct comparisons are perhaps one of the most convincing ways of evaluation, which may even be more important than opinions on each individual object. this paper proposes to study the comparative sentence identification problem. it first categorizes comparative sentences into different types, and then presents a novel integrated pattern discovery and supervised learning approach to identifying comparative sentences from text documents. experiment results using three types of documents, news articles, consumer reviews of products, and internet forum postings, show a precision of 79% and recall of 81%. more detailed results are given in the paper citee id:627 citee title:mining sequential patterns citee abstract:we are given a large database of customer transactions, where each transaction consists of customer-id, transaction time, and the items bought in the transaction. we introduce the problem of mining sequential patterns over such databases. we present three algorithms to solve this problem, and empirically evaluate their performance using synthetic data. two of the proposed algorithms, apriorisome and aprioriall, have comparable performance, albeit apriorisome performs a little better when the surrounding text:in building the learning model, class sequential rules automatically generated from the data are used as features. class sequential rules are different from traditional sequential patterns [***, 2, 25]<2> because a class label is attached, which results in a rule with a sequential pattern on the left-hand-side of the rule, and a class on the right-hand-side of the rule. in our context, the classes are comparative or non-comparative. a further advancement of our work is the use of multiple minimum supports in mining. existing sequential pattern mining techniques in data mining use only a single minimum support [***]<2> to control the pattern generation process so that not too many patterns are produced. the minimum support is simply the probability that a pattern appears in a sentence, which is estimated as the ratio of the number of sentences containing the pattern and the total number of sentences in the data. 4. 1 class sequential rules with multiple minimum supports sequential pattern mining (spm) is an important data mining task [***, 2, 25]<2>. given a set of input sequences, the spm task is to find all sequential patterns that satisfy a user-specified minimum support (or frequency) constraint influence:3 type:2 pair index:698 citer id:626 citer title:identifying comparative sentences in text documents citer abstract:this paper studies the problem of identifying comparative sentences in text documents. the problem is related to but quite different from sentiment/opinion sentence identification or classification. sentiment classification studies the problem of classifying a document or a sentence based on the subjective opinion of the author. an important application area of sentiment/opinion identification is business intelligence as a product manufacturer always wants to know consumers opinions on its products. comparisons on the other hand can be subjective or objective. furthermore, a comparison is not concerned with an object in isolation. instead, it compares the object with others. an example opinion sentence is the sound quality of cd player x is poor. an example comparative sentence is the sound quality of cd player x is not as good as that of cd player y. clearly, these two sentences give different information. their language constructs are quite different too. identifying comparative sentences is also useful in practice because direct comparisons are perhaps one of the most convincing ways of evaluation, which may even be more important than opinions on each individual object. this paper proposes to study the comparative sentence identification problem. it first categorizes comparative sentences into different types, and then presents a novel integrated pattern discovery and supervised learning approach to identifying comparative sentences from text documents. experiment results using three types of documents, news articles, consumer reviews of products, and internet forum postings, show a precision of 79% and recall of 81%. more detailed results are given in the paper citee id:628 citee title:sequential pattern mining using a bitmap representation citee abstract:we introduce a new algorithm for mining sequential patterns. our algorithm is especially efficient when the sequential patterns in the database are very long. we introduce a novel depth-first search strategy that integrates a depth-first traversal of the search space with effective pruning mechanisms. our implementation of the search strategy combines a vertical bitmap representation of the database with efficient support counting. a salient feature of our algorithm is that it incrementally outputs new frequent itemsets in an online fashion. in a thorough experimental evaluation of our algorithm on standard benchmark data from the literature, our algorithm outperforms previous work up to an order of magnitude. surrounding text:in building the learning model, class sequential rules automatically generated from the data are used as features. class sequential rules are different from traditional sequential patterns [1, ***, 25]<2> because a class label is attached, which results in a rule with a sequential pattern on the left-hand-side of the rule, and a class on the right-hand-side of the rule. in our context, the classes are comparative or non-comparative. 4. 1 class sequential rules with multiple minimum supports sequential pattern mining (spm) is an important data mining task [1, ***, 25]<2>. given a set of input sequences, the spm task is to find all sequential patterns that satisfy a user-specified minimum support (or frequency) constraint influence:3 type:2 pair index:699 citer id:626 citer title:identifying comparative sentences in text documents citer abstract:this paper studies the problem of identifying comparative sentences in text documents. the problem is related to but quite different from sentiment/opinion sentence identification or classification. sentiment classification studies the problem of classifying a document or a sentence based on the subjective opinion of the author. an important application area of sentiment/opinion identification is business intelligence as a product manufacturer always wants to know consumers opinions on its products. comparisons on the other hand can be subjective or objective. furthermore, a comparison is not concerned with an object in isolation. instead, it compares the object with others. an example opinion sentence is the sound quality of cd player x is poor. an example comparative sentence is the sound quality of cd player x is not as good as that of cd player y. clearly, these two sentences give different information. their language constructs are quite different too. identifying comparative sentences is also useful in practice because direct comparisons are perhaps one of the most convincing ways of evaluation, which may even be more important than opinions on each individual object. this paper proposes to study the comparative sentence identification problem. it first categorizes comparative sentences into different types, and then presents a novel integrated pattern discovery and supervised learning approach to identifying comparative sentences from text documents. experiment results using three types of documents, news articles, consumer reviews of products, and internet forum postings, show a precision of 79% and recall of 81%. more detailed results are given in the paper citee id:175 citee title:a simple rule-based part of speech tagger citee abstract:automatic part of speech tagging is an area of natural language processing where statistical techniques have been more successful than rule- based methods. in this paper, we present a sim- ple rule-based part of speech tagger which automatically acquires its rules and tags with accuracy coinparable to stochastic taggers. the rule-based tagger has many advantages over these taggers, including: a vast reduction in stored information required, the perspicuity of a sinall set of meaningful rules surrounding text:since we need part-of-speech (pos) tags throughout this section and the paper, let us first acquaint ourselves with some tags and their pos categories. we used brill's tagger [***]<1> to tag sentences. it follows the penn tree bank [28] pos tagging scheme. all the results except for the first two were obtained through 5-fold cross validation. we discuss the results below: 1) pos tags of jjs, rbr, jjs and rbs: we used the brills tagger [***]<1>. if a sentence contains anyone of the above tags, it is classified as a comparative sentence influence:2 type:1 pair index:700 citer id:626 citer title:identifying comparative sentences in text documents citer abstract:this paper studies the problem of identifying comparative sentences in text documents. the problem is related to but quite different from sentiment/opinion sentence identification or classification. sentiment classification studies the problem of classifying a document or a sentence based on the subjective opinion of the author. an important application area of sentiment/opinion identification is business intelligence as a product manufacturer always wants to know consumers opinions on its products. comparisons on the other hand can be subjective or objective. furthermore, a comparison is not concerned with an object in isolation. instead, it compares the object with others. an example opinion sentence is the sound quality of cd player x is poor. an example comparative sentence is the sound quality of cd player x is not as good as that of cd player y. clearly, these two sentences give different information. their language constructs are quite different too. identifying comparative sentences is also useful in practice because direct comparisons are perhaps one of the most convincing ways of evaluation, which may even be more important than opinions on each individual object. this paper proposes to study the comparative sentence identification problem. it first categorizes comparative sentences into different types, and then presents a novel integrated pattern discovery and supervised learning approach to identifying comparative sentences from text documents. experiment results using three types of documents, news articles, consumer reviews of products, and internet forum postings, show a precision of 79% and recall of 81%. more detailed results are given in the paper citee id:629 citee title:mining the peanut gallery: opinion extraction and semantic classification of product reviews citee abstract:the web contains a wealth of product reviews, but sifting through them is a daunting task. ideally, an opinion mining tool would process a set of search results for a given item, generating a list of product attributes (quality, features, etc.) and aggregating opinions about each of them (poor, mixed, good). we begin by identifying the unique properties of this problem and develop a method for automatically distinguishing between positive and negative reviews. our classifier draws on information retrieval techniques for feature extraction and scoring, and the results for various metrics and heuristics vary depending on the testing situation. the best methods work as well as or better than traditional machine learning. when operating on individual sentences collected from web searches, performance is limited due to noise and ambiguity. but in the context of a complete web-based tool and aided by a simple method for grouping sentences into attributes, the results are qualitatively quite useful. surrounding text:[24]<2> examines several supervised machine learning methods for sentiment classification of movie reviews. [***]<2> also experiments a number of learning methods for review classification. they show that the classifiers perform well on whole reviews, but poorly on sentences because a sentence contains much less information influence:2 type:2 pair index:701 citer id:626 citer title:identifying comparative sentences in text documents citer abstract:this paper studies the problem of identifying comparative sentences in text documents. the problem is related to but quite different from sentiment/opinion sentence identification or classification. sentiment classification studies the problem of classifying a document or a sentence based on the subjective opinion of the author. an important application area of sentiment/opinion identification is business intelligence as a product manufacturer always wants to know consumers opinions on its products. comparisons on the other hand can be subjective or objective. furthermore, a comparison is not concerned with an object in isolation. instead, it compares the object with others. an example opinion sentence is the sound quality of cd player x is poor. an example comparative sentence is the sound quality of cd player x is not as good as that of cd player y. clearly, these two sentences give different information. their language constructs are quite different too. identifying comparative sentences is also useful in practice because direct comparisons are perhaps one of the most convincing ways of evaluation, which may even be more important than opinions on each individual object. this paper proposes to study the comparative sentence identification problem. it first categorizes comparative sentences into different types, and then presents a novel integrated pattern discovery and supervised learning approach to identifying comparative sentences from text documents. experiment results using three types of documents, news articles, consumer reviews of products, and internet forum postings, show a precision of 79% and recall of 81%. more detailed results are given in the paper citee id:630 citee title:xtag system-a wide coverage grammar for english citee abstract:this paper presents the xtag system, a grammar development tool based on the tree adjoining grammar (tag) formalism that includes a wide-coverage syntactic grammar for english. the various components of the system are discussed and preliminary evaluation results from the parsing of various corpora are given. results from the comparison of xtag against the ibm statistical parser and the alvey natural language tool parser are also given. 1 introduction xtag is a large on-going project to develop surrounding text:the semantic analysis is based on logic, which is not directly applicable to identifying comparative sentences. the types of comparatives (such as adjectival, adverbial, nominal, superlatives, etc) are described in [***]<0>. the focus of these researches is on a limited set of comparative constructs which have gradable keywords like more, less, etc influence:3 type:3 pair index:702 citer id:626 citer title:identifying comparative sentences in text documents citer abstract:this paper studies the problem of identifying comparative sentences in text documents. the problem is related to but quite different from sentiment/opinion sentence identification or classification. sentiment classification studies the problem of classifying a document or a sentence based on the subjective opinion of the author. an important application area of sentiment/opinion identification is business intelligence as a product manufacturer always wants to know consumers opinions on its products. comparisons on the other hand can be subjective or objective. furthermore, a comparison is not concerned with an object in isolation. instead, it compares the object with others. an example opinion sentence is the sound quality of cd player x is poor. an example comparative sentence is the sound quality of cd player x is not as good as that of cd player y. clearly, these two sentences give different information. their language constructs are quite different too. identifying comparative sentences is also useful in practice because direct comparisons are perhaps one of the most convincing ways of evaluation, which may even be more important than opinions on each individual object. this paper proposes to study the comparative sentence identification problem. it first categorizes comparative sentences into different types, and then presents a novel integrated pattern discovery and supervised learning approach to identifying comparative sentences from text documents. experiment results using three types of documents, news articles, consumer reviews of products, and internet forum postings, show a precision of 79% and recall of 81%. more detailed results are given in the paper citee id:631 citee title:wordnet: an electronic lexical database citee abstract:wordnet is perhaps the most important and widely used lexical resource for natural language processing systems up to now. wordnet: an electronic lexical database, edited by christiane fellbaum, discusses the design of wordnet from both theoretical and historical perspectives, provides an up-to-date description of the lexical database, and presents a set of applications of wordnet. the book contains a foreword by george miller, an introduction by christiane fellbaum, seven chapters from the cognitive sciences laboratory of princeton university, where wordnet was produced, and nine chapters contributed by scientists from elsewhere. surrounding text:we first manually found a list of 30 words by going through a subset of comparative sentences. we then used wordnet [***]<1> to find their synonyms. after manual pruning, a final list of 69 words is produced influence:3 type:3 pair index:703 citer id:626 citer title:identifying comparative sentences in text documents citer abstract:this paper studies the problem of identifying comparative sentences in text documents. the problem is related to but quite different from sentiment/opinion sentence identification or classification. sentiment classification studies the problem of classifying a document or a sentence based on the subjective opinion of the author. an important application area of sentiment/opinion identification is business intelligence as a product manufacturer always wants to know consumers opinions on its products. comparisons on the other hand can be subjective or objective. furthermore, a comparison is not concerned with an object in isolation. instead, it compares the object with others. an example opinion sentence is the sound quality of cd player x is poor. an example comparative sentence is the sound quality of cd player x is not as good as that of cd player y. clearly, these two sentences give different information. their language constructs are quite different too. identifying comparative sentences is also useful in practice because direct comparisons are perhaps one of the most convincing ways of evaluation, which may even be more important than opinions on each individual object. this paper proposes to study the comparative sentence identification problem. it first categorizes comparative sentences into different types, and then presents a novel integrated pattern discovery and supervised learning approach to identifying comparative sentences from text documents. experiment results using three types of documents, news articles, consumer reviews of products, and internet forum postings, show a precision of 79% and recall of 81%. more detailed results are given in the paper citee id:545 citee title:extracting knowledge from evaluative text citee abstract:capturing knowledge from free-form evaluative texts about an entity is a challenging task. new techniques of feature extraction, polarity determination and strength evaluation have been proposed. feature extraction is particularly important to the task as it provides the underpinnings of the extracted knowledge. the work in this paper introduces an improved method for feature extraction that draws on an existing unsupervised method. by including user-specific prior knowledge of the evaluated entity, we turn the task of feature extraction into one of term similarity by mapping crude (learned) features into a user-defined taxonomy of the entity's features. results show promise both in terms of the accuracy of the mapping as well as the reduction in the semantic redundancy of crude features. surrounding text:specifically, they identify product features that have been commented on by customers and determining whether the opinions are positive or negative. [26, ***]<2> improve the work in [11]<2>. however, none of these studies is on comparison, which is the focus of this work influence:3 type:2 pair index:704 citer id:626 citer title:identifying comparative sentences in text documents citer abstract:this paper studies the problem of identifying comparative sentences in text documents. the problem is related to but quite different from sentiment/opinion sentence identification or classification. sentiment classification studies the problem of classifying a document or a sentence based on the subjective opinion of the author. an important application area of sentiment/opinion identification is business intelligence as a product manufacturer always wants to know consumers opinions on its products. comparisons on the other hand can be subjective or objective. furthermore, a comparison is not concerned with an object in isolation. instead, it compares the object with others. an example opinion sentence is the sound quality of cd player x is poor. an example comparative sentence is the sound quality of cd player x is not as good as that of cd player y. clearly, these two sentences give different information. their language constructs are quite different too. identifying comparative sentences is also useful in practice because direct comparisons are perhaps one of the most convincing ways of evaluation, which may even be more important than opinions on each individual object. this paper proposes to study the comparative sentence identification problem. it first categorizes comparative sentences into different types, and then presents a novel integrated pattern discovery and supervised learning approach to identifying comparative sentences from text documents. experiment results using three types of documents, news articles, consumer reviews of products, and internet forum postings, show a precision of 79% and recall of 81%. more detailed results are given in the paper citee id:459 citee title:effects of adjective orientation and gradability on sentence subjectivity citee abstract:subjectivity is a pragmatic, sentence-level feature that has important implications for texl processing applicalions such as information exlractiou and information iclricwd. we study tile elfeels of dymunic adjectives, semantically oriented adjectives, and gradable ad.ieclivcs on a simple subjectivity classiiicr, and establish lhat lhcy arc strong predictors of subjectivity. a novel trainable mclhod thai statistically combines two indicators of gradability is presented and ewlhlalcd, complementing exisling automatic icchniques for assigning orientation labels. surrounding text:they show that the classifiers perform well on whole reviews, but poorly on sentences because a sentence contains much less information. [***]<2> investigates sentence subjectivity classification. a method is proposed to find adjectives that are indicative of positive or negative opinions. [32]<2> proposes a similar method for nouns. other related works on sentiment classification and opinions discovery include [***, 15, 16, 23, 27, 33, 34, 35]<2>. in [11, 19]<2>, several unsupervised and supervised techniques are proposed to analyze opinions in customer reviews influence:2 type:2 pair index:705 citer id:626 citer title:identifying comparative sentences in text documents citer abstract:this paper studies the problem of identifying comparative sentences in text documents. the problem is related to but quite different from sentiment/opinion sentence identification or classification. sentiment classification studies the problem of classifying a document or a sentence based on the subjective opinion of the author. an important application area of sentiment/opinion identification is business intelligence as a product manufacturer always wants to know consumers opinions on its products. comparisons on the other hand can be subjective or objective. furthermore, a comparison is not concerned with an object in isolation. instead, it compares the object with others. an example opinion sentence is the sound quality of cd player x is poor. an example comparative sentence is the sound quality of cd player x is not as good as that of cd player y. clearly, these two sentences give different information. their language constructs are quite different too. identifying comparative sentences is also useful in practice because direct comparisons are perhaps one of the most convincing ways of evaluation, which may even be more important than opinions on each individual object. this paper proposes to study the comparative sentence identification problem. it first categorizes comparative sentences into different types, and then presents a novel integrated pattern discovery and supervised learning approach to identifying comparative sentences from text documents. experiment results using three types of documents, news articles, consumer reviews of products, and internet forum postings, show a precision of 79% and recall of 81%. more detailed results are given in the paper citee id:417 citee title:direction-based text interpretation as an information access refinement citee abstract:a text-based intelligent system should provide more in-depth information about the contents of its corpus than does a standard information retrieval system, while at the same time avoiding the complexity and resource-consuming behavior of detailed text understanders. instead of focusing on discovering documents that pertain to some topic of interest to the user, an approach is introduced based on the criterion of directionality (e.g., is the agent in favor of, neutral, or opposed to the event?). a method is described for coercing sentence meanings into a metaphoric model such that the only semantic interpretation needed in order to determine the directionality of a sentence is done with respect to the model. this interpretation method is designed to be an integrated component of a hybrid information access system. surrounding text:sentiment classification classifies opinion texts or sentences as positive or negative. work of hearst [***]<2> on classification of entire documents uses models inspired by cognitive linguistics. das and chen [4]<2> use a manually crafted lexicon in conjunction with several scoring methods to classify stock postings influence:3 type:2 pair index:706 citer id:626 citer title:identifying comparative sentences in text documents citer abstract:this paper studies the problem of identifying comparative sentences in text documents. the problem is related to but quite different from sentiment/opinion sentence identification or classification. sentiment classification studies the problem of classifying a document or a sentence based on the subjective opinion of the author. an important application area of sentiment/opinion identification is business intelligence as a product manufacturer always wants to know consumers opinions on its products. comparisons on the other hand can be subjective or objective. furthermore, a comparison is not concerned with an object in isolation. instead, it compares the object with others. an example opinion sentence is the sound quality of cd player x is poor. an example comparative sentence is the sound quality of cd player x is not as good as that of cd player y. clearly, these two sentences give different information. their language constructs are quite different too. identifying comparative sentences is also useful in practice because direct comparisons are perhaps one of the most convincing ways of evaluation, which may even be more important than opinions on each individual object. this paper proposes to study the comparative sentence identification problem. it first categorizes comparative sentences into different types, and then presents a novel integrated pattern discovery and supervised learning approach to identifying comparative sentences from text documents. experiment results using three types of documents, news articles, consumer reviews of products, and internet forum postings, show a precision of 79% and recall of 81%. more detailed results are given in the paper citee id:403 citee title:mining and summarizing customer reviews citee abstract:merchants selling products on the web often ask their customers to review the products that they have purchased and the associated services. as e-commerce is becoming more and more popular, the number of customer reviews that a product receives grows rapidly. for a popular product, the number of reviews can be in hundreds or even thousands. this makes it difficult for a potential customer to read them to make an informed decision on whether to purchase the product. it also makes it difficult for the manufacturer of the product to keep track and to manage customer opinions. for the manufacturer, there are additional difficulties because many merchant sites may sell the same product and the manufacturer normally produces many kinds of products. in this research, we aim to mine and to summarize all the customer reviews of a product. this summarization task is different from traditional text summarization because we only mine the features of the product on which the customers have expressed their opinions and whether the opinions are positive or negative. we do not summarize the reviews by selecting a subset or rewrite some of the original sentences from the reviews to capture the main points as in the classic text summarization. our task is performed in three steps: (1) mining product features that have been commented on by customers; (2) identifying opinion sentences in each review and deciding whether each opinion sentence is positive or negative; (3) summarizing the results. this paper proposes several novel techniques to perform these tasks. our experimental results using reviews of a number of products sold online demonstrate the effectiveness of the techniques surrounding text:other related works on sentiment classification and opinions discovery include [9, 15, 16, 23, 27, 33, 34, 35]<2>. in [***, 19]<2>, several unsupervised and supervised techniques are proposed to analyze opinions in customer reviews. specifically, they identify product features that have been commented on by customers and determining whether the opinions are positive or negative. specifically, they identify product features that have been commented on by customers and determining whether the opinions are positive or negative. [26, 8]<2> improve the work in [***]<2>. however, none of these studies is on comparison, which is the focus of this work influence:2 type:2 pair index:707 citer id:626 citer title:identifying comparative sentences in text documents citer abstract:this paper studies the problem of identifying comparative sentences in text documents. the problem is related to but quite different from sentiment/opinion sentence identification or classification. sentiment classification studies the problem of classifying a document or a sentence based on the subjective opinion of the author. an important application area of sentiment/opinion identification is business intelligence as a product manufacturer always wants to know consumers opinions on its products. comparisons on the other hand can be subjective or objective. furthermore, a comparison is not concerned with an object in isolation. instead, it compares the object with others. an example opinion sentence is the sound quality of cd player x is poor. an example comparative sentence is the sound quality of cd player x is not as good as that of cd player y. clearly, these two sentences give different information. their language constructs are quite different too. identifying comparative sentences is also useful in practice because direct comparisons are perhaps one of the most convincing ways of evaluation, which may even be more important than opinions on each individual object. this paper proposes to study the comparative sentence identification problem. it first categorizes comparative sentences into different types, and then presents a novel integrated pattern discovery and supervised learning approach to identifying comparative sentences from text documents. experiment results using three types of documents, news articles, consumer reviews of products, and internet forum postings, show a precision of 79% and recall of 81%. more detailed results are given in the paper citee id:109 citee title:mining comparative sentences and relations citee abstract:this paper studies a text mining problem, comparative sentence mining (csm). a comparative sentence expresses an ordering relation between two sets of entities with respect to some common features. for example, the comparative sentence canons optics are better than those of sony and nikon expresses the comparative relation: (better, , , ). given a set of evaluative texts on the web, e.g., reviews, forum postings, and news articles, the task of comparative sentence mining is (1) to identify comparative sentences from the texts and (2) to extract comparative relations from the identified comparative sentences. this problem has many applications. for example, a product manufacturer wants to know customer opinions of its products in comparison with those of its competitors. in this paper, we propose two novel techniques based on two new types of sequential rules to perform the tasks. experimental evaluation has been conducted using different types of evaluative texts from the web. results show that our techniques are very promising. surrounding text:, what are compared and which is better. this extraction problem is studied in [***]<0>. in summary, this paper makes three contributions: 1 influence:1 type:2 pair index:708 citer id:626 citer title:identifying comparative sentences in text documents citer abstract:this paper studies the problem of identifying comparative sentences in text documents. the problem is related to but quite different from sentiment/opinion sentence identification or classification. sentiment classification studies the problem of classifying a document or a sentence based on the subjective opinion of the author. an important application area of sentiment/opinion identification is business intelligence as a product manufacturer always wants to know consumers opinions on its products. comparisons on the other hand can be subjective or objective. furthermore, a comparison is not concerned with an object in isolation. instead, it compares the object with others. an example opinion sentence is the sound quality of cd player x is poor. an example comparative sentence is the sound quality of cd player x is not as good as that of cd player y. clearly, these two sentences give different information. their language constructs are quite different too. identifying comparative sentences is also useful in practice because direct comparisons are perhaps one of the most convincing ways of evaluation, which may even be more important than opinions on each individual object. this paper proposes to study the comparative sentence identification problem. it first categorizes comparative sentences into different types, and then presents a novel integrated pattern discovery and supervised learning approach to identifying comparative sentences from text documents. experiment results using three types of documents, news articles, consumer reviews of products, and internet forum postings, show a precision of 79% and recall of 81%. more detailed results are given in the paper citee id:632 citee title:making large-scale svm learning practical citee abstract:training a support vector machine (svm) leads to a quadratic optimization problem with bound constraints and one linear equality constraint. despite the fact that this type of problem is well understood, there are many issues to be considered in designing an svm learner. in particular, for large learning tasks with many training examples, off-the-shelf optimization techniques for general quadratic programs quickly become intractable in their memory and time requirements. svmlight is an surrounding text:ve bayesian model [20, 21]<1> to learn a classifier using the class sequential rules as features. for comparison purposes, we also experimented with support vector machines (svm) [31, ***]<1>, which is considered to be one of the strongest classifier building methods. we conducted empirical evaluation using three types of documents, news articles, consumer reviews of products, and internet forum discussions influence:3 type:2 pair index:709 citer id:626 citer title:identifying comparative sentences in text documents citer abstract:this paper studies the problem of identifying comparative sentences in text documents. the problem is related to but quite different from sentiment/opinion sentence identification or classification. sentiment classification studies the problem of classifying a document or a sentence based on the subjective opinion of the author. an important application area of sentiment/opinion identification is business intelligence as a product manufacturer always wants to know consumers opinions on its products. comparisons on the other hand can be subjective or objective. furthermore, a comparison is not concerned with an object in isolation. instead, it compares the object with others. an example opinion sentence is the sound quality of cd player x is poor. an example comparative sentence is the sound quality of cd player x is not as good as that of cd player y. clearly, these two sentences give different information. their language constructs are quite different too. identifying comparative sentences is also useful in practice because direct comparisons are perhaps one of the most convincing ways of evaluation, which may even be more important than opinions on each individual object. this paper proposes to study the comparative sentence identification problem. it first categorizes comparative sentences into different types, and then presents a novel integrated pattern discovery and supervised learning approach to identifying comparative sentences from text documents. experiment results using three types of documents, news articles, consumer reviews of products, and internet forum postings, show a precision of 79% and recall of 81%. more detailed results are given in the paper citee id:633 citee title:summarization and tracking in news and blog corpora citee abstract:humans like to express their opinions and are eager to know others opinions. automatically mining and organizing opinions from heterogeneous information sources are very useful for individuals, organizations and even governments. opinion extraction, opinion summarization and opinion tracking are three important techniques for understanding opinions. opinion extraction mines opinions at word, sentence and document levels from articles. opinion summarization summarizes opinions of articles by telling sentiment polarities, degree and the correlated events. in this paper, both news and web blog articles are investigated. trec, ntcir and articles collected from web blogs serve as the information sources for opinion extraction. documents related to the issue of animal cloning are selected as the experimental materials. algorithms for opinion extraction at word, sentence and document level are proposed. the issue of relevant sentence selection is discussed, and then topical and opinionated information are summarized. opinion summarizations are visualized by representative sentences. text-based summaries in different languages, and from different sources, are compared. finally, an opinionated curve showing supportive and nonsupportive degree along the timeline is illustrated by an opinion tracking system. surrounding text:[32]<2> proposes a similar method for nouns. other related works on sentiment classification and opinions discovery include [9, 15, ***, 23, 27, 33, 34, 35]<2>. in [11, 19]<2>, several unsupervised and supervised techniques are proposed to analyze opinions in customer reviews influence:2 type:2 pair index:710 citer id:626 citer title:identifying comparative sentences in text documents citer abstract:this paper studies the problem of identifying comparative sentences in text documents. the problem is related to but quite different from sentiment/opinion sentence identification or classification. sentiment classification studies the problem of classifying a document or a sentence based on the subjective opinion of the author. an important application area of sentiment/opinion identification is business intelligence as a product manufacturer always wants to know consumers opinions on its products. comparisons on the other hand can be subjective or objective. furthermore, a comparison is not concerned with an object in isolation. instead, it compares the object with others. an example opinion sentence is the sound quality of cd player x is poor. an example comparative sentence is the sound quality of cd player x is not as good as that of cd player y. clearly, these two sentences give different information. their language constructs are quite different too. identifying comparative sentences is also useful in practice because direct comparisons are perhaps one of the most convincing ways of evaluation, which may even be more important than opinions on each individual object. this paper proposes to study the comparative sentence identification problem. it first categorizes comparative sentences into different types, and then presents a novel integrated pattern discovery and supervised learning approach to identifying comparative sentences from text documents. experiment results using three types of documents, news articles, consumer reviews of products, and internet forum postings, show a precision of 79% and recall of 81%. more detailed results are given in the paper citee id:114 citee title:opinion observer: analyzing and comparing opinions on the web citee abstract:the web has become an excellent source for gathering consumer opinions. there are now numerous web sites containing such opinions, e.g., customer reviews of products, forums, discussion groups, and blogs. this paper focuses on online customer reviews of products. it makes two contributions. first, it proposes a novel framework for analyzing and comparing consumer opinions of competing products. a prototype system called opinion observer is also implemented. the system is such that with a single glance of its visualization, the user is able to clearly see the strengths and weaknesses of each product in the minds of consumers in terms of various product features. this comparison is useful to both potential customers and product manufacturers. for a potential customer, he/she can see a visual side-by-side and feature-by-feature comparison of consumer opinions on these products, which helps him/her to decide which product to buy. for a product manufacturer, the comparison enables it to easily gather marketing intelligence and product benchmarking information. second, a new technique based on language pattern mining is proposed to extract product features from pros and cons in a particular type of reviews. such features form the basis for the above comparison. experimental results show that the technique is highly effective and outperform existing methods significantly. surrounding text:other related works on sentiment classification and opinions discovery include [9, 15, 16, 23, 27, 33, 34, 35]<2>. in [11, ***]<2>, several unsupervised and supervised techniques are proposed to analyze opinions in customer reviews. specifically, they identify product features that have been commented on by customers and determining whether the opinions are positive or negative influence:2 type:2 pair index:711 citer id:626 citer title:identifying comparative sentences in text documents citer abstract:this paper studies the problem of identifying comparative sentences in text documents. the problem is related to but quite different from sentiment/opinion sentence identification or classification. sentiment classification studies the problem of classifying a document or a sentence based on the subjective opinion of the author. an important application area of sentiment/opinion identification is business intelligence as a product manufacturer always wants to know consumers opinions on its products. comparisons on the other hand can be subjective or objective. furthermore, a comparison is not concerned with an object in isolation. instead, it compares the object with others. an example opinion sentence is the sound quality of cd player x is poor. an example comparative sentence is the sound quality of cd player x is not as good as that of cd player y. clearly, these two sentences give different information. their language constructs are quite different too. identifying comparative sentences is also useful in practice because direct comparisons are perhaps one of the most convincing ways of evaluation, which may even be more important than opinions on each individual object. this paper proposes to study the comparative sentence identification problem. it first categorizes comparative sentences into different types, and then presents a novel integrated pattern discovery and supervised learning approach to identifying comparative sentences from text documents. experiment results using three types of documents, news articles, consumer reviews of products, and internet forum postings, show a precision of 79% and recall of 81%. more detailed results are given in the paper citee id:634 citee title:web data mining: exploring hyperlinks, contents, and usage data citee abstract:association rule mining is an important model in data mining. its mining algorithms discover all item associations (or rules) in the data that satisfy the user-specified minimum support (minsup) and minimum confidence (minconf) constraints. minsup controls the minimum number of data cases that a rule must cover. minconf controls the predictive strength of the rule. since only one minsup is used for the whole database, the model implicitly assumes that all items in the data are of the same nature and/or have similar frequencies in the data. this is, however, seldom the case in reallife applications. in many applications, some items appear very frequently in the data, while others rarely appear. if minsup is set too high, those rules that involve rare items will not be found. to find rules that involve both frequent and rare items, minsup has to be set very low. this may cause combinatorial explosion because those frequent items will be associated with one another in all possible ways. this dilemma is called the rare item problem. this paper proposes a novel technique to solve this problem. the technique allows the user to specify multiple minimum supports to reflect the natures of the items and their varied frequencies in the database. in rule mining, different rules may need to satisfy different minimum supports depending on what items are in the rules. experiment results show that the technique is very effective. surrounding text:we then used the na. ve bayesian model [***, 21]<1> to learn a classifier using the class sequential rules as features. for comparison purposes, we also experimented with support vector machines (svm) [31, 13]<1>, which is considered to be one of the strongest classifier building methods. the na. ve bayesian classification model (nb) [***, 21]<1> provides a natural solution as it is able to combine multiple probabilities to arrive at a single probability decision. our experiment results show that the classifier built using this learning approach based on the class sequential rules performs much better influence:3 type:1 pair index:712 citer id:626 citer title:identifying comparative sentences in text documents citer abstract:this paper studies the problem of identifying comparative sentences in text documents. the problem is related to but quite different from sentiment/opinion sentence identification or classification. sentiment classification studies the problem of classifying a document or a sentence based on the subjective opinion of the author. an important application area of sentiment/opinion identification is business intelligence as a product manufacturer always wants to know consumers opinions on its products. comparisons on the other hand can be subjective or objective. furthermore, a comparison is not concerned with an object in isolation. instead, it compares the object with others. an example opinion sentence is the sound quality of cd player x is poor. an example comparative sentence is the sound quality of cd player x is not as good as that of cd player y. clearly, these two sentences give different information. their language constructs are quite different too. identifying comparative sentences is also useful in practice because direct comparisons are perhaps one of the most convincing ways of evaluation, which may even be more important than opinions on each individual object. this paper proposes to study the comparative sentence identification problem. it first categorizes comparative sentences into different types, and then presents a novel integrated pattern discovery and supervised learning approach to identifying comparative sentences from text documents. experiment results using three types of documents, news articles, consumer reviews of products, and internet forum postings, show a precision of 79% and recall of 81%. more detailed results are given in the paper citee id:115 citee title:sentiment analysis: capturing favorability using natural language processing citee abstract:this paper illustrates a sentiment analysis approach to extract sentiments associated with polarities of positive or negative for specific subjects from a document, instead of classifying the whole document into positive or negative.the essential issues in sentiment analysis are to identify how sentiments are expressed in texts and whether the expressions indicate positive (favorable) or negative (unfavorable) opinions toward the subject. in order to improve the accuracy of the sentiment analysis, it is important to properly identify the semantic relationships between the sentiment expressions and the subject. by applying semantic analysis with a syntactic parser and sentiment lexicon, our prototype system achieved high precision (75-95%, depending on the data) in finding sentiments within web pages and news articles. surrounding text:[32]<2> proposes a similar method for nouns. other related works on sentiment classification and opinions discovery include [9, 15, 16, ***, 27, 33, 34, 35]<2>. in [11, 19]<2>, several unsupervised and supervised techniques are proposed to analyze opinions in customer reviews influence:2 type:2 pair index:713 citer id:626 citer title:identifying comparative sentences in text documents citer abstract:this paper studies the problem of identifying comparative sentences in text documents. the problem is related to but quite different from sentiment/opinion sentence identification or classification. sentiment classification studies the problem of classifying a document or a sentence based on the subjective opinion of the author. an important application area of sentiment/opinion identification is business intelligence as a product manufacturer always wants to know consumers opinions on its products. comparisons on the other hand can be subjective or objective. furthermore, a comparison is not concerned with an object in isolation. instead, it compares the object with others. an example opinion sentence is the sound quality of cd player x is poor. an example comparative sentence is the sound quality of cd player x is not as good as that of cd player y. clearly, these two sentences give different information. their language constructs are quite different too. identifying comparative sentences is also useful in practice because direct comparisons are perhaps one of the most convincing ways of evaluation, which may even be more important than opinions on each individual object. this paper proposes to study the comparative sentence identification problem. it first categorizes comparative sentences into different types, and then presents a novel integrated pattern discovery and supervised learning approach to identifying comparative sentences from text documents. experiment results using three types of documents, news articles, consumer reviews of products, and internet forum postings, show a precision of 79% and recall of 81%. more detailed results are given in the paper citee id:118 citee title:thumbs up? sentiment classification using machine learning techniques citee abstract:we consider the problem of classifying documents not by topic, but by overall sentiment, e.g., determining whether a review is positive or negative. using movie reviews as data, we find that standard machine learning techniques definitively outperform human-produced baselines. however, the three machine learning methods we employed (naive bayes, maximum entropy classification, and support vector machines) do not perform as well on sentiment classification as on traditional topic-based categorization. we conclude by examining factors that make the sentiment classification problem more challenging. surrounding text:[30]<2> applies a unsupervised learning technique based on mutual information between document phrases and the words excellent and poor to find indicative words of opinions for classification. [***]<2> examines several supervised machine learning methods for sentiment classification of movie reviews. [5]<2> also experiments a number of learning methods for review classification influence:2 type:2 pair index:714 citer id:626 citer title:identifying comparative sentences in text documents citer abstract:this paper studies the problem of identifying comparative sentences in text documents. the problem is related to but quite different from sentiment/opinion sentence identification or classification. sentiment classification studies the problem of classifying a document or a sentence based on the subjective opinion of the author. an important application area of sentiment/opinion identification is business intelligence as a product manufacturer always wants to know consumers opinions on its products. comparisons on the other hand can be subjective or objective. furthermore, a comparison is not concerned with an object in isolation. instead, it compares the object with others. an example opinion sentence is the sound quality of cd player x is poor. an example comparative sentence is the sound quality of cd player x is not as good as that of cd player y. clearly, these two sentences give different information. their language constructs are quite different too. identifying comparative sentences is also useful in practice because direct comparisons are perhaps one of the most convincing ways of evaluation, which may even be more important than opinions on each individual object. this paper proposes to study the comparative sentence identification problem. it first categorizes comparative sentences into different types, and then presents a novel integrated pattern discovery and supervised learning approach to identifying comparative sentences from text documents. experiment results using three types of documents, news articles, consumer reviews of products, and internet forum postings, show a precision of 79% and recall of 81%. more detailed results are given in the paper citee id:635 citee title:mining sequential patterns by pattern-growth: the prefixspan approach citee abstract:sequential pattern mining is an important data mining problem with broad applications. however, it is also a difficult problem since the mining may have to generate or examine a combinatorially explosive number of intermediate subsequences. most of the previously developed sequential pattern mining methods, such as gsp, explore a candidate generation-and-test approach to reduce the number of candidates to be examined. however, this approach may not be efficient in mining large sequence databases having numerous patterns and/or long patterns. in this paper, we propose a projection-based, sequential pattern-growth approach for efficient mining of sequential patterns. in this approach, a sequence database is recursively projected into a set of smaller projected databases, and sequential patterns are grown in each projected database by exploring only locally frequent fragments. based on an initial study of the pattern growth-based sequential pattern mining, freespan , we propose a more efficient method, called psp, which offers ordered growth and reduced projected databases. to further improve the performance, a pseudoprojection technique is developed in prefixspan. a comprehensive performance study shows that prefixspan, in most cases, outperforms the a priori-based algorithm gsp, freespan, and spade (a sequential pattern mining algorithm that adopts vertical data format), and prefixspan integrated with pseudoprojection is the fastest among all the tested algorithms. furthermore, this mining methodology can be extended to mining sequential patterns with user-specified constraints. the high promise of the pattern-growth approach may lead to its further extension toward efficient mining of other kinds of frequent patterns, such as frequent substructures. surrounding text:in building the learning model, class sequential rules automatically generated from the data are used as features. class sequential rules are different from traditional sequential patterns [1, 2, ***]<2> because a class label is attached, which results in a rule with a sequential pattern on the left-hand-side of the rule, and a class on the right-hand-side of the rule. in our context, the classes are comparative or non-comparative. 4. 1 class sequential rules with multiple minimum supports sequential pattern mining (spm) is an important data mining task [1, 2, ***]<2>. given a set of input sequences, the spm task is to find all sequential patterns that satisfy a user-specified minimum support (or frequency) constraint influence:3 type:2 pair index:715 citer id:626 citer title:identifying comparative sentences in text documents citer abstract:this paper studies the problem of identifying comparative sentences in text documents. the problem is related to but quite different from sentiment/opinion sentence identification or classification. sentiment classification studies the problem of classifying a document or a sentence based on the subjective opinion of the author. an important application area of sentiment/opinion identification is business intelligence as a product manufacturer always wants to know consumers opinions on its products. comparisons on the other hand can be subjective or objective. furthermore, a comparison is not concerned with an object in isolation. instead, it compares the object with others. an example opinion sentence is the sound quality of cd player x is poor. an example comparative sentence is the sound quality of cd player x is not as good as that of cd player y. clearly, these two sentences give different information. their language constructs are quite different too. identifying comparative sentences is also useful in practice because direct comparisons are perhaps one of the most convincing ways of evaluation, which may even be more important than opinions on each individual object. this paper proposes to study the comparative sentence identification problem. it first categorizes comparative sentences into different types, and then presents a novel integrated pattern discovery and supervised learning approach to identifying comparative sentences from text documents. experiment results using three types of documents, news articles, consumer reviews of products, and internet forum postings, show a precision of 79% and recall of 81%. more detailed results are given in the paper citee id:119 citee title:extracting product features and opinions from reviews citee abstract:consumers are often forced to wade through many on-line reviews in order to make an informed product choice. this paper introduces opine, an unsupervised information-extraction system which mines reviews in order to build a model of important product features, their evaluation by reviewers, and their relative quality across products.compared to previous work, opine achieves 22% higher precision (with only 3% lower recall) on the feature extraction task. opine's novel use of relaxation labeling for finding the semantic orientation of words in context leads to strong performance on the tasks of finding opinion phrases and their polarity. surrounding text:specifically, they identify product features that have been commented on by customers and determining whether the opinions are positive or negative. [***, 8]<2> improve the work in [11]<2>. however, none of these studies is on comparison, which is the focus of this work influence:2 type:2 pair index:716 citer id:626 citer title:identifying comparative sentences in text documents citer abstract:this paper studies the problem of identifying comparative sentences in text documents. the problem is related to but quite different from sentiment/opinion sentence identification or classification. sentiment classification studies the problem of classifying a document or a sentence based on the subjective opinion of the author. an important application area of sentiment/opinion identification is business intelligence as a product manufacturer always wants to know consumers opinions on its products. comparisons on the other hand can be subjective or objective. furthermore, a comparison is not concerned with an object in isolation. instead, it compares the object with others. an example opinion sentence is the sound quality of cd player x is poor. an example comparative sentence is the sound quality of cd player x is not as good as that of cd player y. clearly, these two sentences give different information. their language constructs are quite different too. identifying comparative sentences is also useful in practice because direct comparisons are perhaps one of the most convincing ways of evaluation, which may even be more important than opinions on each individual object. this paper proposes to study the comparative sentence identification problem. it first categorizes comparative sentences into different types, and then presents a novel integrated pattern discovery and supervised learning approach to identifying comparative sentences from text documents. experiment results using three types of documents, news articles, consumer reviews of products, and internet forum postings, show a precision of 79% and recall of 81%. more detailed results are given in the paper citee id:120 citee title:learning extraction patterns for subjective expressions citee abstract:this paper presents a bootstrapping process that learns linguistically rich extraction patterns for subjective (opinionated) expressions. high-precision classifiers label unannotated data to automatically create a large training set, which is then given to an extraction pattern learning algorithm. the learned patterns are then used to identify more subjective sentences. the bootstrapping process learns many subjective patterns and increases recall while maintaining high precision. surrounding text:[32]<2> proposes a similar method for nouns. other related works on sentiment classification and opinions discovery include [9, 15, 16, 23, ***, 33, 34, 35]<2>. in [11, 19]<2>, several unsupervised and supervised techniques are proposed to analyze opinions in customer reviews influence:3 type:2 pair index:717 citer id:626 citer title:identifying comparative sentences in text documents citer abstract:this paper studies the problem of identifying comparative sentences in text documents. the problem is related to but quite different from sentiment/opinion sentence identification or classification. sentiment classification studies the problem of classifying a document or a sentence based on the subjective opinion of the author. an important application area of sentiment/opinion identification is business intelligence as a product manufacturer always wants to know consumers opinions on its products. comparisons on the other hand can be subjective or objective. furthermore, a comparison is not concerned with an object in isolation. instead, it compares the object with others. an example opinion sentence is the sound quality of cd player x is poor. an example comparative sentence is the sound quality of cd player x is not as good as that of cd player y. clearly, these two sentences give different information. their language constructs are quite different too. identifying comparative sentences is also useful in practice because direct comparisons are perhaps one of the most convincing ways of evaluation, which may even be more important than opinions on each individual object. this paper proposes to study the comparative sentence identification problem. it first categorizes comparative sentences into different types, and then presents a novel integrated pattern discovery and supervised learning approach to identifying comparative sentences from text documents. experiment results using three types of documents, news articles, consumer reviews of products, and internet forum postings, show a precision of 79% and recall of 81%. more detailed results are given in the paper citee id:123 citee title:towards answering opinion questions: separating facts from opinions and identifying the polarity of opinion sentences citee abstract:opinion question answering is a challenging task for natural language processing. in this paper, we discuss a necessary component for an opinion question answering system: separating opinions from fact, at both the document and sentence level. we present a bayesian classifier for discriminating between documents with a preponderance of opinions such as editorials from regular news stories, and describe three unsupervised, statistical techniques for the significantly harder task of detecting opinions at the sentence level. we also present a first model for classifying opinion sentences as positive or negative in terms of the main perspective being expressed in the opinion. results from a large collection of news stories and a human evaluation of 400 sentences are reported, indicating that we achieve very high performance in document classification (upwards of 97% precision and recall), and respectable performance in detecting opinions and classifying them at the sentence level as positive, negative, or neutral (up to 91% accuracy). surrounding text:[32]<2> proposes a similar method for nouns. other related works on sentiment classification and opinions discovery include [9, 15, 16, 23, 27, 33, ***, 35]<2>. in [11, 19]<2>, several unsupervised and supervised techniques are proposed to analyze opinions in customer reviews influence:2 type:2 pair index:718 citer id:626 citer title:identifying comparative sentences in text documents citer abstract:this paper studies the problem of identifying comparative sentences in text documents. the problem is related to but quite different from sentiment/opinion sentence identification or classification. sentiment classification studies the problem of classifying a document or a sentence based on the subjective opinion of the author. an important application area of sentiment/opinion identification is business intelligence as a product manufacturer always wants to know consumers opinions on its products. comparisons on the other hand can be subjective or objective. furthermore, a comparison is not concerned with an object in isolation. instead, it compares the object with others. an example opinion sentence is the sound quality of cd player x is poor. an example comparative sentence is the sound quality of cd player x is not as good as that of cd player y. clearly, these two sentences give different information. their language constructs are quite different too. identifying comparative sentences is also useful in practice because direct comparisons are perhaps one of the most convincing ways of evaluation, which may even be more important than opinions on each individual object. this paper proposes to study the comparative sentence identification problem. it first categorizes comparative sentences into different types, and then presents a novel integrated pattern discovery and supervised learning approach to identifying comparative sentences from text documents. experiment results using three types of documents, news articles, consumer reviews of products, and internet forum postings, show a precision of 79% and recall of 81%. more detailed results are given in the paper citee id:29 citee title:a cross-collection mixture model for comparative text mining citee abstract:problem, which we refer to as comparative text mining (ctm). given a set of comparable text collections, the task of comparative text mining is to discover any latent common themes across all collections as well as summarize the similarity and di#erences of these collections along each common theme. this general problem subsumes many interesting applications, including business intelligence and opinion summarization. we propose a generative probabilistic mixture model for comparative text surrounding text:[32]<2> proposes a similar method for nouns. other related works on sentiment classification and opinions discovery include [9, 15, 16, 23, 27, 33, 34, ***]<2>. in [11, 19]<2>, several unsupervised and supervised techniques are proposed to analyze opinions in customer reviews influence:2 type:2 pair index:719 citer id:645 citer title:incorporate the syntactic knowledge in opinion mining in citer abstract:with the development of the accessibly to the internet, the content of the web is now being changed. user-generated content (ugc), such a kind of novel media content produced by end-users, has taken off in past few years with the revolution of web 2.0 and its flourish is especially impressive in china. the adoption of ugc has been proven to be beneficial to numbers of traditional tasks. however, the dramatic increase in the volume of such data prevents users from utilizing in a manual way and thus automatic mining approaches are demanded. opinion mining, a recent data mining technique at the crossroad of information retrieval and computational linguistics, is pretty suitable for this kind of information processing. in our paper, we dedicate our work to the main two subtasks of opinion mining: topic extraction and sentiment classification. we propose approaches to these two issues respectively for chinese based on the consideration of syntactic knowledge. we take the blog data, which is a typical application of ugc, as the evaluating data in our experiments and the results show that our approaches to the two tasks are promising. we also give an introduction to our future plans stemmed from the work done in this paper: an intelligent advertisement placement system in ugc citee id:646 citee title:the rise of user-generated content citee abstract:web 2.0 is dominated by the consumer or end user. with only a browser and an internet connection, anyone can publish content. they have the power to speak out, whether it's reviewing a product or service or developing their own brands with less capital than was required before. surrounding text:this novel term has been around for a while, but has only really taken off in the past few years with the revolution of web 2. 0, transforming from discussion boards to the entire websites through social networking, blogging and video sharing ([***]<3>). the flourish of ugc is especially impressive in china influence:2 type:3 pair index:720 citer id:645 citer title:incorporate the syntactic knowledge in opinion mining in citer abstract:with the development of the accessibly to the internet, the content of the web is now being changed. user-generated content (ugc), such a kind of novel media content produced by end-users, has taken off in past few years with the revolution of web 2.0 and its flourish is especially impressive in china. the adoption of ugc has been proven to be beneficial to numbers of traditional tasks. however, the dramatic increase in the volume of such data prevents users from utilizing in a manual way and thus automatic mining approaches are demanded. opinion mining, a recent data mining technique at the crossroad of information retrieval and computational linguistics, is pretty suitable for this kind of information processing. in our paper, we dedicate our work to the main two subtasks of opinion mining: topic extraction and sentiment classification. we propose approaches to these two issues respectively for chinese based on the consideration of syntactic knowledge. we take the blog data, which is a typical application of ugc, as the evaluating data in our experiments and the results show that our approaches to the two tasks are promising. we also give an introduction to our future plans stemmed from the work done in this paper: an intelligent advertisement placement system in ugc citee id:113 citee title:opinion extraction, summarization and tracking in news and blog corpora citee abstract:humans like to express their opinions and are eager to know others opinions. automatically mining and organizing opinions from heterogeneous information sources are very useful for individuals, organizations and even governments. opinion extraction, opinion summarization and opinion tracking are three important techniques for understanding opinions. opinion extraction mines opinions at word, sentence and document levels from articles. opinion summarization summarizes opinions of articles by telling sentiment polarities, degree and the correlated events. in this paper, both news and web blog articles are investigated. trec, ntcir and articles collected from web blogs serve as the information sources for opinion extraction. documents related to the issue of animal cloning are selected as the experimental materials. algorithms for opinion extraction at word, sentence and document level are proposed. the issue of relevant sentence selection is discussed, and then topical and opinionated information are summarized. opinion summarizations are visualized by representative sentences. text-based summaries in different languages, and from different sources, are compared. finally, an opinionated curve showing supportive and nonsupportive degree along the timeline is illustrated by an opinion tracking system. surrounding text:in addition, our work focuses on the context of chinese in which word segmentation and syntactic parsing are issues which are both different from those in english. part of the work on sentiment classification for sentences/documents takes the postulation that sentiment orientation of the whole is a function of that of the parts ([***]<2>). in other words, sentiment orientation of sentences/documents can be inferred from that of words contained in these sentences/documents. in our paper, our own sentiment dictionary is built upon that of [***]<1>, and because of the chinese segmentation issue 2 we refine their dictionary as our own containing 1262 positive words and 2930 negative ones. to define rules to extract topics of opinions in sentences, we examine syntactic role that sentiment word plays in a sentence within a large corpus of opinion sentences, and then select 2 different segmentation algorithms generate different segmentation results given a same character sequence, therefore the algorithm employed in our work fails to get some words in the dictionary of [***]<2>. manually the corresponding topic, along with its syntactic role and its relationship with the sentiment word influence:1 type:2 pair index:721 citer id:645 citer title:incorporate the syntactic knowledge in opinion mining in citer abstract:with the development of the accessibly to the internet, the content of the web is now being changed. user-generated content (ugc), such a kind of novel media content produced by end-users, has taken off in past few years with the revolution of web 2.0 and its flourish is especially impressive in china. the adoption of ugc has been proven to be beneficial to numbers of traditional tasks. however, the dramatic increase in the volume of such data prevents users from utilizing in a manual way and thus automatic mining approaches are demanded. opinion mining, a recent data mining technique at the crossroad of information retrieval and computational linguistics, is pretty suitable for this kind of information processing. in our paper, we dedicate our work to the main two subtasks of opinion mining: topic extraction and sentiment classification. we propose approaches to these two issues respectively for chinese based on the consideration of syntactic knowledge. we take the blog data, which is a typical application of ugc, as the evaluating data in our experiments and the results show that our approaches to the two tasks are promising. we also give an introduction to our future plans stemmed from the work done in this paper: an intelligent advertisement placement system in ugc citee id:111 citee title:determining the sentiment of opinions citee abstract:identifying sentiments (the affective parts of opinions) is a challenging problem. we present a system that, given a topic, automatically finds the people who hold opinions about that topic and the sentiment of each opinion. the system contains a module for determining word sentiment and another for combining sentiments within a sentence. we experiment with various models of classifying and combining sentiment at word and sentence levels, with promising results. surrounding text:since its first investigation in the year of 1997 ([5]<2>), a lot of work has already been done on opinion mining ([6], [7], [***], [9]<2>). according to the definition in [***]<2>, opinion is described as a quadruple including topic, holder, claim and sentiment that the holder believes a claim about the topic and in many cases associates a sentiment with the belief. take the sentence tom said the movie was nice for example, tom and movie are the holder and topic of the opinion respectively. sentences with the average log-likelihood scores exceeding the pre-set threshold are classified as positive, sentences with scores lower than the threshold are classified as negative and those with in-between scores are treated as neutral. in the sentiment classification module of the work in [***]<2>, three models are proposed. all these models do the classification based on the word sentiment orientation influence:1 type:2 pair index:722 citer id:645 citer title:incorporate the syntactic knowledge in opinion mining in citer abstract:with the development of the accessibly to the internet, the content of the web is now being changed. user-generated content (ugc), such a kind of novel media content produced by end-users, has taken off in past few years with the revolution of web 2.0 and its flourish is especially impressive in china. the adoption of ugc has been proven to be beneficial to numbers of traditional tasks. however, the dramatic increase in the volume of such data prevents users from utilizing in a manual way and thus automatic mining approaches are demanded. opinion mining, a recent data mining technique at the crossroad of information retrieval and computational linguistics, is pretty suitable for this kind of information processing. in our paper, we dedicate our work to the main two subtasks of opinion mining: topic extraction and sentiment classification. we propose approaches to these two issues respectively for chinese based on the consideration of syntactic knowledge. we take the blog data, which is a typical application of ugc, as the evaluating data in our experiments and the results show that our approaches to the two tasks are promising. we also give an introduction to our future plans stemmed from the work done in this paper: an intelligent advertisement placement system in ugc citee id:214 citee title:towards a robust metric of opinion citee abstract:this paper describes an automated system for detecting polar expressions about a topic of interest. the two elementary components of this approach are a shallow nlp polar language extraction system and a machine learning based topic classifier. these components are composed together by making a simple but accurate collocation assumption: if a topical sentence contains polar language, the system predicts that the polar language is reflective of the topic, and not some other subject matter. we evaluate our system, components and assumption on a corpus of online consumer messages. based on these components, we discuss how to measure the overall sentiment about a particular topic as expressed in online messages authored by many different people. we propose to use the fundamentals of bayesian statistics to form an aggregate authorial opinion metric. this metric would propagate uncertainties introduced by the polarity and topic modules to facilitate statistically valid comparisons of opinion across multiple topics. surrounding text:other applications stemming from this mining technique include market intelligence, advertisement placement, opinion search and etc. since its first investigation in the year of 1997 ([5]<2>), a lot of work has already been done on opinion mining ([6], [7], [8], [***]<2>). according to the definition in [8]<2>, opinion is described as a quadruple including topic, holder, claim and sentiment that the holder believes a claim about the topic and in many cases associates a sentiment with the belief influence:2 type:2 pair index:723 citer id:645 citer title:incorporate the syntactic knowledge in opinion mining in citer abstract:with the development of the accessibly to the internet, the content of the web is now being changed. user-generated content (ugc), such a kind of novel media content produced by end-users, has taken off in past few years with the revolution of web 2.0 and its flourish is especially impressive in china. the adoption of ugc has been proven to be beneficial to numbers of traditional tasks. however, the dramatic increase in the volume of such data prevents users from utilizing in a manual way and thus automatic mining approaches are demanded. opinion mining, a recent data mining technique at the crossroad of information retrieval and computational linguistics, is pretty suitable for this kind of information processing. in our paper, we dedicate our work to the main two subtasks of opinion mining: topic extraction and sentiment classification. we propose approaches to these two issues respectively for chinese based on the consideration of syntactic knowledge. we take the blog data, which is a typical application of ugc, as the evaluating data in our experiments and the results show that our approaches to the two tasks are promising. we also give an introduction to our future plans stemmed from the work done in this paper: an intelligent advertisement placement system in ugc citee id:647 citee title:opinion holder extraction from author and authority viewpoints citee abstract:opinion holder extraction research is important for discriminating between opinions that are viewed from different perspectives. in this paper, we describe our experience of participation in the ntcir-6 opinion analysis pilot task by focusing on opinion holder extraction results in japanese and english. our approach to opinion holder extraction was based on the discrimination between author and authority viewpoints in opinionated sentences, and the evaluation results were fair with respect to the japanese documents surrounding text:sometimes, the holder just expresses a judgment on an object without any underlying sentiments, such as the sentence jerry believes the movie will be shown on friday. research has been conducted on the different components of opinions as presented above, such as holder identification ([***]<2>), topic/target extraction ([11], [12]<2>) and sentiment classification ([13], [15], [16], [17], [18], [19], [20]<2>). in our work, we focus on the latter two subtasks which most current research on opinion mining dedicates to influence:2 type:2 pair index:724 citer id:645 citer title:incorporate the syntactic knowledge in opinion mining in citer abstract:with the development of the accessibly to the internet, the content of the web is now being changed. user-generated content (ugc), such a kind of novel media content produced by end-users, has taken off in past few years with the revolution of web 2.0 and its flourish is especially impressive in china. the adoption of ugc has been proven to be beneficial to numbers of traditional tasks. however, the dramatic increase in the volume of such data prevents users from utilizing in a manual way and thus automatic mining approaches are demanded. opinion mining, a recent data mining technique at the crossroad of information retrieval and computational linguistics, is pretty suitable for this kind of information processing. in our paper, we dedicate our work to the main two subtasks of opinion mining: topic extraction and sentiment classification. we propose approaches to these two issues respectively for chinese based on the consideration of syntactic knowledge. we take the blog data, which is a typical application of ugc, as the evaluating data in our experiments and the results show that our approaches to the two tasks are promising. we also give an introduction to our future plans stemmed from the work done in this paper: an intelligent advertisement placement system in ugc citee id:403 citee title:mining and summarizing customer reviews citee abstract:merchants selling products on the web often ask their customers to review the products that they have purchased and the associated services. as e-commerce is becoming more and more popular, the number of customer reviews that a product receives grows rapidly. for a popular product, the number of reviews can be in hundreds or even thousands. this makes it difficult for a potential customer to read them to make an informed decision on whether to purchase the product. it also makes it difficult for the manufacturer of the product to keep track and to manage customer opinions. for the manufacturer, there are additional difficulties because many merchant sites may sell the same product and the manufacturer normally produces many kinds of products. in this research, we aim to mine and to summarize all the customer reviews of a product. this summarization task is different from traditional text summarization because we only mine the features of the product on which the customers have expressed their opinions and whether the opinions are positive or negative. we do not summarize the reviews by selecting a subset or rewrite some of the original sentences from the reviews to capture the main points as in the classic text summarization. our task is performed in three steps: (1) mining product features that have been commented on by customers; (2) identifying opinion sentences in each review and deciding whether each opinion sentence is positive or negative; (3) summarizing the results. this paper proposes several novel techniques to perform these tasks. our experimental results using reviews of a number of products sold online demonstrate the effectiveness of the techniques surrounding text:related work current work on opinion mining mainly focuses on topic related issues and sentiment classification. in following sections, we first describe the topic related work by mingqing hu ([***]<2>) and then show the previous work on sentiment classification at word and sentence/document levels. in [***]<2>, hu et al$ focus on the customer reviews of products and propose an approach to mining and summarizing these reviews. in following sections, we first describe the topic related work by mingqing hu ([***]<2>) and then show the previous work on sentiment classification at word and sentence/document levels. in [***]<2>, hu et al$ focus on the customer reviews of products and propose an approach to mining and summarizing these reviews. their summarization task is performed in three steps: mining product features that have been commented by customers. 4. 1 examination on the quality of super set to examine the performance of our approach to building the super set for a topic, we compare our approach with the association rules approach (using the apriori algorithm) employed in [***]<2>. in addition, to compare the valuable information contained in ugc with that in traditional online content (which we call authority generated content (agc)), we conduct experiments on two different kinds of data sets: data collected from blogs and those from news reports as the representatives for ugc and agc respectively influence:1 type:2 pair index:725 citer id:645 citer title:incorporate the syntactic knowledge in opinion mining in citer abstract:with the development of the accessibly to the internet, the content of the web is now being changed. user-generated content (ugc), such a kind of novel media content produced by end-users, has taken off in past few years with the revolution of web 2.0 and its flourish is especially impressive in china. the adoption of ugc has been proven to be beneficial to numbers of traditional tasks. however, the dramatic increase in the volume of such data prevents users from utilizing in a manual way and thus automatic mining approaches are demanded. opinion mining, a recent data mining technique at the crossroad of information retrieval and computational linguistics, is pretty suitable for this kind of information processing. in our paper, we dedicate our work to the main two subtasks of opinion mining: topic extraction and sentiment classification. we propose approaches to these two issues respectively for chinese based on the consideration of syntactic knowledge. we take the blog data, which is a typical application of ugc, as the evaluating data in our experiments and the results show that our approaches to the two tasks are promising. we also give an introduction to our future plans stemmed from the work done in this paper: an intelligent advertisement placement system in ugc citee id:648 citee title:topic sentiment mixture: modeling facets and opinions in weblogs citee abstract:in this paper, we define the problem of topic-sentiment analysis on weblogs and propose a novel probabilistic model to capture the mixture of topics and sentiments simultaneously. the proposed topic-sentiment mixture (tsm) model can reveal the latent topical facets in a weblog collection, the subtopics in the results of an ad hoc query, and their associated sentiments. it could also provide general sentiment models that are applicable to any ad hoc topics. with a specifically designed hmm structure, the sentiment models and topic models estimated with tsm can be utilized to extract topic life cycles and sentiment dynamics. empirical experiments on different weblog datasets show that this approach is effective for modeling the topic facets and sentiments and extracting their dynamics from weblog collections. the tsm model is quite general; it can be applied to any text collections with a mixture of topics and sentiments, thus has many potential applications, such as search result summarization, opinion tracking, and user behavior prediction. surrounding text:sometimes, the holder just expresses a judgment on an object without any underlying sentiments, such as the sentence jerry believes the movie will be shown on friday. research has been conducted on the different components of opinions as presented above, such as holder identification ([10]<2>), topic/target extraction ([11], [***]<2>) and sentiment classification ([13], [15], [16], [17], [18], [19], [20]<2>). in our work, we focus on the latter two subtasks which most current research on opinion mining dedicates to influence:1 type:2 pair index:726 citer id:645 citer title:incorporate the syntactic knowledge in opinion mining in citer abstract:with the development of the accessibly to the internet, the content of the web is now being changed. user-generated content (ugc), such a kind of novel media content produced by end-users, has taken off in past few years with the revolution of web 2.0 and its flourish is especially impressive in china. the adoption of ugc has been proven to be beneficial to numbers of traditional tasks. however, the dramatic increase in the volume of such data prevents users from utilizing in a manual way and thus automatic mining approaches are demanded. opinion mining, a recent data mining technique at the crossroad of information retrieval and computational linguistics, is pretty suitable for this kind of information processing. in our paper, we dedicate our work to the main two subtasks of opinion mining: topic extraction and sentiment classification. we propose approaches to these two issues respectively for chinese based on the consideration of syntactic knowledge. we take the blog data, which is a typical application of ugc, as the evaluating data in our experiments and the results show that our approaches to the two tasks are promising. we also give an introduction to our future plans stemmed from the work done in this paper: an intelligent advertisement placement system in ugc citee id:649 citee title:learning subjective adjective from corpora citee abstract:subjectivity tagging is distinguishing sentences used to present opinions and evaluations from sentences used to objectively present factual information. there are numerous applications for which subjectivity tagging is relevant, including information extraction and information retrieval. this paper identifies strong clues of subjectivity using the results of a method for clustering words according to distributional similarity (lin 1998), seeded by a small amount of detailed manual annotation surrounding text:the underlying intuition is that the act of conjoining adjectives subjects to linguistic constraints on the orientation of the adjective involved, that is, and usually conjoins two adjectives of the same orientation while but conjoins two adjectives of opposite orientations. another study on word orientation determination also taking advantage of the linguistic knowledge is [***]<2>. in this paper, the author focuses on the problem of subjectivity tagging which distinguishes sentences used to present opinions and other forms of subjectivity from sentences used to objectively present factual information influence:2 type:2 pair index:727 citer id:645 citer title:incorporate the syntactic knowledge in opinion mining in citer abstract:with the development of the accessibly to the internet, the content of the web is now being changed. user-generated content (ugc), such a kind of novel media content produced by end-users, has taken off in past few years with the revolution of web 2.0 and its flourish is especially impressive in china. the adoption of ugc has been proven to be beneficial to numbers of traditional tasks. however, the dramatic increase in the volume of such data prevents users from utilizing in a manual way and thus automatic mining approaches are demanded. opinion mining, a recent data mining technique at the crossroad of information retrieval and computational linguistics, is pretty suitable for this kind of information processing. in our paper, we dedicate our work to the main two subtasks of opinion mining: topic extraction and sentiment classification. we propose approaches to these two issues respectively for chinese based on the consideration of syntactic knowledge. we take the blog data, which is a typical application of ugc, as the evaluating data in our experiments and the results show that our approaches to the two tasks are promising. we also give an introduction to our future plans stemmed from the work done in this paper: an intelligent advertisement placement system in ugc citee id:270 citee title:automatic retrieval and clustering of similar words citee abstract:bootstrapping semantics from text is one of the greatest challenges in natural language learning. earlier research showed that it is possible to automatically identify words that are semantically similar to a given word based on the syntactic collocation patterns of the words. we present an approach that goes a step further by obtaining a tree structure among the most similar words so that different senses of a given word can be identified with different subtrees. submission type: paper topic surrounding text:in order to solve this problem, she proposes an approach to finding good linguistic clues -- the subjective adjectives, from a large corpus. she identifies high quality adjectives using the results of a method for clustering words according to distributional similarity ([***]<3>), seeded by a small amount of simple adjectives extracted from a detailed manually annotated corpus. turney et al influence:3 type:3 pair index:728 citer id:645 citer title:incorporate the syntactic knowledge in opinion mining in citer abstract:with the development of the accessibly to the internet, the content of the web is now being changed. user-generated content (ugc), such a kind of novel media content produced by end-users, has taken off in past few years with the revolution of web 2.0 and its flourish is especially impressive in china. the adoption of ugc has been proven to be beneficial to numbers of traditional tasks. however, the dramatic increase in the volume of such data prevents users from utilizing in a manual way and thus automatic mining approaches are demanded. opinion mining, a recent data mining technique at the crossroad of information retrieval and computational linguistics, is pretty suitable for this kind of information processing. in our paper, we dedicate our work to the main two subtasks of opinion mining: topic extraction and sentiment classification. we propose approaches to these two issues respectively for chinese based on the consideration of syntactic knowledge. we take the blog data, which is a typical application of ugc, as the evaluating data in our experiments and the results show that our approaches to the two tasks are promising. we also give an introduction to our future plans stemmed from the work done in this paper: an intelligent advertisement placement system in ugc citee id:650 citee title:measuring praise and criticism: inference of semantic orientation from association citee abstract:the evaluative character of a word is called its semantic orientation. positive semantic orientation indicates praise (e.g., "honest", "intrepid") and negative semantic orientation indicates criticism (e.g., "disturbing", "superfluous"). semantic orientation varies in both direction (positive or negative) and degree (mild to strong). an automated system for measuring semantic orientation would have application in text classification, text filtering, tracking opinions in online discussions, analysis of survey responses, and automated chat systems (chatbots). this paper introduces a method for inferring the semantic orientation of a word from its statistical association with a set of positive and negative paradigm words. two instances of this approach are evaluated, based on two different statistical measures of word association: pointwise mutual information (pmi) and latent semantic analysis (lsa). the method is experimentally tested with 3,596 words (including adjectives, adverbs, nouns, and verbs) that have been manually labeled positive (1,614 words) and negative (1,982 words). the method attains an accuracy of 82.8% on the full test set, but the accuracy rises above 95% when the algorithm is allowed to abstain from classifying mild words. surrounding text:turney et al. in [***]<2> adopt a different methodology which requires little linguistic knowledge. they first define two minimal sets of seed terms as descriptive of the categories positive sp and negative sn influence:2 type:2 pair index:729 citer id:645 citer title:incorporate the syntactic knowledge in opinion mining in citer abstract:with the development of the accessibly to the internet, the content of the web is now being changed. user-generated content (ugc), such a kind of novel media content produced by end-users, has taken off in past few years with the revolution of web 2.0 and its flourish is especially impressive in china. the adoption of ugc has been proven to be beneficial to numbers of traditional tasks. however, the dramatic increase in the volume of such data prevents users from utilizing in a manual way and thus automatic mining approaches are demanded. opinion mining, a recent data mining technique at the crossroad of information retrieval and computational linguistics, is pretty suitable for this kind of information processing. in our paper, we dedicate our work to the main two subtasks of opinion mining: topic extraction and sentiment classification. we propose approaches to these two issues respectively for chinese based on the consideration of syntactic knowledge. we take the blog data, which is a typical application of ugc, as the evaluating data in our experiments and the results show that our approaches to the two tasks are promising. we also give an introduction to our future plans stemmed from the work done in this paper: an intelligent advertisement placement system in ugc citee id:123 citee title:towards answering opinion questions: separating facts from opinions and identifying the polarity of opinion sentences citee abstract:opinion question answering is a challenging task for natural language processing. in this paper, we discuss a necessary component for an opinion question answering system: separating opinions from fact, at both the document and sentence level. we present a bayesian classifier for discriminating between documents with a preponderance of opinions such as editorials from regular news stories, and describe three unsupervised, statistical techniques for the significantly harder task of detecting opinions at the sentence level. we also present a first model for classifying opinion sentences as positive or negative in terms of the main perspective being expressed in the opinion. results from a large collection of news stories and a human evaluation of 400 sentences are reported, indicating that we achieve very high performance in document classification (upwards of 97% precision and recall), and respectable performance in detecting opinions and classifying them at the sentence level as positive, negative, or neutral (up to 91% accuracy). surrounding text:in other words, sentiment orientation of sentences/documents can be inferred from that of words contained in these sentences/documents. authors in [***]<2> first separate subjective sentences from factual sentences and then calculate the average per word log-likelihood scores of these sentences from the polarity values of words. sentences with the average log-likelihood scores exceeding the pre-set threshold are classified as positive, sentences with scores lower than the threshold are classified as negative and those with in-between scores are treated as neutral influence:2 type:2 pair index:730 citer id:645 citer title:incorporate the syntactic knowledge in opinion mining in citer abstract:with the development of the accessibly to the internet, the content of the web is now being changed. user-generated content (ugc), such a kind of novel media content produced by end-users, has taken off in past few years with the revolution of web 2.0 and its flourish is especially impressive in china. the adoption of ugc has been proven to be beneficial to numbers of traditional tasks. however, the dramatic increase in the volume of such data prevents users from utilizing in a manual way and thus automatic mining approaches are demanded. opinion mining, a recent data mining technique at the crossroad of information retrieval and computational linguistics, is pretty suitable for this kind of information processing. in our paper, we dedicate our work to the main two subtasks of opinion mining: topic extraction and sentiment classification. we propose approaches to these two issues respectively for chinese based on the consideration of syntactic knowledge. we take the blog data, which is a typical application of ugc, as the evaluating data in our experiments and the results show that our approaches to the two tasks are promising. we also give an introduction to our future plans stemmed from the work done in this paper: an intelligent advertisement placement system in ugc citee id:118 citee title:thumbs up? sentiment classification using machine learning techniques citee abstract:we consider the problem of classifying documents not by topic, but by overall sentiment, e.g., determining whether a review is positive or negative. using movie reviews as data, we find that standard machine learning techniques definitively outperform human-produced baselines. however, the three machine learning methods we employed (naive bayes, maximum entropy classification, and support vector machines) do not perform as well on sentiment classification as on traditional topic-based categorization. we conclude by examining factors that make the sentiment classification problem more challenging. surrounding text:other work on sentence sentiment classification utilizes machine learning methods. [***]<2> takes advantage of three classical machine learning approaches to do the classification: naive bayes, maximum entropy and support vector machines. they conduct experiments on movie reviews and the results show that machine learning methods outperform previous methods based on word sentiment orientation influence:2 type:2 pair index:731 citer id:645 citer title:incorporate the syntactic knowledge in opinion mining in citer abstract:with the development of the accessibly to the internet, the content of the web is now being changed. user-generated content (ugc), such a kind of novel media content produced by end-users, has taken off in past few years with the revolution of web 2.0 and its flourish is especially impressive in china. the adoption of ugc has been proven to be beneficial to numbers of traditional tasks. however, the dramatic increase in the volume of such data prevents users from utilizing in a manual way and thus automatic mining approaches are demanded. opinion mining, a recent data mining technique at the crossroad of information retrieval and computational linguistics, is pretty suitable for this kind of information processing. in our paper, we dedicate our work to the main two subtasks of opinion mining: topic extraction and sentiment classification. we propose approaches to these two issues respectively for chinese based on the consideration of syntactic knowledge. we take the blog data, which is a typical application of ugc, as the evaluating data in our experiments and the results show that our approaches to the two tasks are promising. we also give an introduction to our future plans stemmed from the work done in this paper: an intelligent advertisement placement system in ugc citee id:651 citee title:the utility of linguistic rules in opinion mining citee abstract:online product reviews are one of the important opinion sources on the web. this paper studies the problem of determining the semantic orientations (positive or negative) of opinions expressed on product features in reviews. most existing approaches use a set of opinion words for the purpose. however, the semantic orientations of many words are context dependent. in this paper, we propose to use some linguistic rules to deal with the problem together with a new opinion aggregation function. extensive experiments show that these rules and the function are highly effective. a system, called opinion observer, has also been built surrounding text:sometimes, the holder just expresses a judgment on an object without any underlying sentiments, such as the sentence jerry believes the movie will be shown on friday. research has been conducted on the different components of opinions as presented above, such as holder identification ([10]<2>), topic/target extraction ([11], [12]<2>) and sentiment classification ([13], [15], [16], [17], [18], [19], [***]<2>). in our work, we focus on the latter two subtasks which most current research on opinion mining dedicates to. . there is still another situation as mentioned in [***]<2> that the sentiment word alone carries both polarities, such as the word big, long. however, we would rather regard these words merely as feature-related description words which require domain knowledge to determine the polarity or some other methods as proposed in [***]<2>. there is still another situation as mentioned in [***]<2> that the sentiment word alone carries both polarities, such as the word big, long. however, we would rather regard these words merely as feature-related description words which require domain knowledge to determine the polarity or some other methods as proposed in [***]<2>. in our paper, we focus on sentiment words of the first situation which carry determined polarities influence:1 type:2 pair index:732 citer id:645 citer title:incorporate the syntactic knowledge in opinion mining in citer abstract:with the development of the accessibly to the internet, the content of the web is now being changed. user-generated content (ugc), such a kind of novel media content produced by end-users, has taken off in past few years with the revolution of web 2.0 and its flourish is especially impressive in china. the adoption of ugc has been proven to be beneficial to numbers of traditional tasks. however, the dramatic increase in the volume of such data prevents users from utilizing in a manual way and thus automatic mining approaches are demanded. opinion mining, a recent data mining technique at the crossroad of information retrieval and computational linguistics, is pretty suitable for this kind of information processing. in our paper, we dedicate our work to the main two subtasks of opinion mining: topic extraction and sentiment classification. we propose approaches to these two issues respectively for chinese based on the consideration of syntactic knowledge. we take the blog data, which is a typical application of ugc, as the evaluating data in our experiments and the results show that our approaches to the two tasks are promising. we also give an introduction to our future plans stemmed from the work done in this paper: an intelligent advertisement placement system in ugc citee id:546 citee title:extracting opinion topics for chinese opinions using dependence grammar citee abstract:previous work on opinion/sentiment mining focuses only on sentiment classification with the postulation that topics are identified a prior. however, this assumption often fails in reality. in advertising, topics on which users are commenting are crucial as corresponding advertisements can only be promoted when advertisers have the idea of what users are referring to. in this paper, we propose a rule-based approach to extracting topics from opinion sentences given these sentences identified from texts in advance. we build up a sentiment dictionary and define several rules based on the syntactic roles of words using the dependence grammar which is considered to be more suitable for chinese natural language parsing. the experiments show encouraging results surrounding text:acm 978-1-60558-085-2/08/04. in the subtask of topic extraction, we propose an approach based on our previous work ([***]<1>). in our approach to topic extraction, we first extract topics from opinion sentences through syntactic parsing of sentences using the dependency grammar. 3. 2 extract exact topics of opinions in our approach to extracting the topics of opinions, we first identify the initial topics (which are actually the subtopics of a topic) in opinion sentences based on seven rules defined in [***]<1>. then we construct a super set consisting of subtopics for a topic and filter out the noisy ones in initial topics influence:1 type:1 pair index:733 citer id:645 citer title:incorporate the syntactic knowledge in opinion mining in citer abstract:with the development of the accessibly to the internet, the content of the web is now being changed. user-generated content (ugc), such a kind of novel media content produced by end-users, has taken off in past few years with the revolution of web 2.0 and its flourish is especially impressive in china. the adoption of ugc has been proven to be beneficial to numbers of traditional tasks. however, the dramatic increase in the volume of such data prevents users from utilizing in a manual way and thus automatic mining approaches are demanded. opinion mining, a recent data mining technique at the crossroad of information retrieval and computational linguistics, is pretty suitable for this kind of information processing. in our paper, we dedicate our work to the main two subtasks of opinion mining: topic extraction and sentiment classification. we propose approaches to these two issues respectively for chinese based on the consideration of syntactic knowledge. we take the blog data, which is a typical application of ugc, as the evaluating data in our experiments and the results show that our approaches to the two tasks are promising. we also give an introduction to our future plans stemmed from the work done in this paper: an intelligent advertisement placement system in ugc citee id:221 citee title:an efficient syntactic tagging toll for corpora citee abstract:the tree bank is an important resoures for mt and linguistics researches, but it requires that large number of sentences be annotated with syntactic information. it is time consuming and troublesome, and difficult to keep consistency, if annotation is done manually. in this paper, we presented a new technique for the semi-automatic tagging of chinese text. the system takes as input chinese text, and outputs the syntactically tagged sentence(dependency tree). we use dependency grammar and employ a stack based shift/ reduce context-dependent parser as the tagging mechanism. the system works in human-machine cooperative way, in which the machine can acquire tagging rules from human intervention. the automation level can be improved step by step by accumulating rules during annotation. in addition, good consistency of tagging is guaranteed surrounding text:1 dependency grammar substantial efforts have been made on syntactic parsing of natural languages, and many sophisticated parsing grammars have been proposed to describe different aspects of linguistic characteristic, such as phrase structure grammar ([23]<3>), link grammar ([24]<3>) and dependency grammar ([25]<3>). it is believed that dependency grammar is more suitable for chinese natural language processing ([***]<3>). therefore, we employ the dependency grammar in our work influence:3 type:3 pair index:734 citer id:645 citer title:incorporate the syntactic knowledge in opinion mining in citer abstract:with the development of the accessibly to the internet, the content of the web is now being changed. user-generated content (ugc), such a kind of novel media content produced by end-users, has taken off in past few years with the revolution of web 2.0 and its flourish is especially impressive in china. the adoption of ugc has been proven to be beneficial to numbers of traditional tasks. however, the dramatic increase in the volume of such data prevents users from utilizing in a manual way and thus automatic mining approaches are demanded. opinion mining, a recent data mining technique at the crossroad of information retrieval and computational linguistics, is pretty suitable for this kind of information processing. in our paper, we dedicate our work to the main two subtasks of opinion mining: topic extraction and sentiment classification. we propose approaches to these two issues respectively for chinese based on the consideration of syntactic knowledge. we take the blog data, which is a typical application of ugc, as the evaluating data in our experiments and the results show that our approaches to the two tasks are promising. we also give an introduction to our future plans stemmed from the work done in this paper: an intelligent advertisement placement system in ugc citee id:652 citee title:parsing english with a link grammar citee abstract:we develop a formal grammatical system called a link grammar, show how english grammar can be encoded in such a system, and give algorithms for efficiently parsing with a link grammar. although the expressive power of link grammars is equivalent to that of context free grammars, encoding natural language grammars appears to be much easier with the new system. we have written a program for general link parsing and written a link grammar for the english language. the performance of this surrounding text:3. 1 dependency grammar substantial efforts have been made on syntactic parsing of natural languages, and many sophisticated parsing grammars have been proposed to describe different aspects of linguistic characteristic, such as phrase structure grammar ([23]<3>), link grammar ([***]<3>) and dependency grammar ([25]<3>). it is believed that dependency grammar is more suitable for chinese natural language processing ([22]<3>) influence:3 type:3 pair index:735 citer id:659 citer title:towards better integration of semantic predictors in statistical language modeling citer abstract:we introduce a number of techniques designed to help integrate semantic knowledge with n-gram language models for automatic speech recognition. our techniques allow us to integrate latent semantic analysis (lsa), a word-similarity algorithm based on word co-occurrence information, with n-gram models. while lsa is good at predicting content words which are coherent with the rest of a text, it is a bad predictor of frequent words, has a low dynamic range, and is inaccurate when combined linearly with n-grams. we show that modifying the dynamic range, applying a per-word confidence metric, and using geometric rather than linear combinations with n-grams produces a more robust language model which has a lower perplexity on a wall street journal testset than a baseline n-gram model citee id:147 citee title:a novel word clustering algorithm based on latent semantic analysis citee abstract:a new approach is proposed for the clustering of words in a given vocabulary. the method is based on a paradigm first formulated in the context of information retrieval, called latent semantic analysis. this paradigm leads to a parsimonious vector representation of each word in a suitable vector space, where familiar clustering techniques can be applied. the distance measure selected in this space arises naturally from the problem formulation. preliminary experiments indicate that, the clusters produced are intuitively satisfactory. because these clusters are semantic in nature, this approach may prove useful as a complement to conventional class-based statistical language modeling techniques surrounding text:[8]<2>, [6]<2>). in previous work, we [***]<2> and others [2]<2>, [5]<2> have suggested the use of latent semantic analysis (lsa) [3]<1> as a model of semantic knowledge to be applied to asr. lsa is a model of word semantic similarity based on word co-occurrence tendencies, and has been successful in ir and nlp applications, such as spelling correction [7]<3> influence:3 type:2 pair index:736 citer id:659 citer title:towards better integration of semantic predictors in statistical language modeling citer abstract:we introduce a number of techniques designed to help integrate semantic knowledge with n-gram language models for automatic speech recognition. our techniques allow us to integrate latent semantic analysis (lsa), a word-similarity algorithm based on word co-occurrence information, with n-gram models. while lsa is good at predicting content words which are coherent with the rest of a text, it is a bad predictor of frequent words, has a low dynamic range, and is inaccurate when combined linearly with n-grams. we show that modifying the dynamic range, applying a per-word confidence metric, and using geometric rather than linear combinations with n-grams produces a more robust language model which has a lower perplexity on a wall street journal testset than a baseline n-gram model citee id:655 citee title:indexing by latent semantic analysis citee abstract:a new method for automatic indexing and retrieval is described. the approach is to take advantage of implicit higher-order structure in the association of terms with documents (semantic structure) in order to improve the detection of relevant documents on the basis of terms found in queries. the particular technique used is singular-value decomposition, in which a large term by document matrix is decomposed into a set of ca. 100 or- thogonal factors from which the original matrix can be approximated by linear combination. documents are represented by ca. 100 item vectors of factor weights. queries are represented as pseudo-document vectors formed from weighted combinations of terms, and documents with supra-threshold cosine values are re- turned. initial tests find this completely automatic method for retrieval to be promising. surrounding text:[8]<2>, [6]<2>). in previous work, we [1]<2> and others [2]<2>, [5]<2> have suggested the use of latent semantic analysis (lsa) [***]<1> as a model of semantic knowledge to be applied to asr. lsa is a model of word semantic similarity based on word co-occurrence tendencies, and has been successful in ir and nlp applications, such as spelling correction [7]<3> influence:1 type:1 pair index:737 citer id:659 citer title:towards better integration of semantic predictors in statistical language modeling citer abstract:we introduce a number of techniques designed to help integrate semantic knowledge with n-gram language models for automatic speech recognition. our techniques allow us to integrate latent semantic analysis (lsa), a word-similarity algorithm based on word co-occurrence information, with n-gram models. while lsa is good at predicting content words which are coherent with the rest of a text, it is a bad predictor of frequent words, has a low dynamic range, and is inaccurate when combined linearly with n-grams. we show that modifying the dynamic range, applying a per-word confidence metric, and using geometric rather than linear combinations with n-grams produces a more robust language model which has a lower perplexity on a wall street journal testset than a baseline n-gram model citee id:644 citee title:improving the retrieval of information from external sources citee abstract:a major barrier to successful retrieval from external sources (e.g., electronic databases) is the tremendous variability in the words that people use to describe objects of interest. the fact that different authors use different words to describe essentially the same idea means that relevant objects will be missed; conversely, the fact that the same word canbe used to refer to many different things means that irrelevant objects will be retrieved. we describe a statistical method called latent semantic indexing, which models the implicit higher order structure in the association of words and objects and improves retrieval performance by up to 30%. additional large performance improvements of 40% and 67% can be achieved through the use of differential term weighting and iterative retrieval methods. surrounding text:we introduce a confidence metric associated with each word that helps determine to what degree the lsa model is effective at predicting that word. our confidence metric is a global term weighting found to be useful in ir applications: the entropy of the frequency of a word over all documents in the training corpus [***]<3>. thus, the lsa confidence for term i is calculated by pndocs j= 1 p (jji)log( p (jji)) lsa confidencei =1+ log(ndocs) where ndocs is the number of documents in the corpus, and p (ji) is the likelihood of document j given that term i occurs j in it j: count of term i in document jp (jji) = count of term i in whole corpus fi lsa lsa word n-grm lsa conf influence:3 type:3 pair index:738 citer id:659 citer title:towards better integration of semantic predictors in statistical language modeling citer abstract:we introduce a number of techniques designed to help integrate semantic knowledge with n-gram language models for automatic speech recognition. our techniques allow us to integrate latent semantic analysis (lsa), a word-similarity algorithm based on word co-occurrence information, with n-gram models. while lsa is good at predicting content words which are coherent with the rest of a text, it is a bad predictor of frequent words, has a low dynamic range, and is inaccurate when combined linearly with n-grams. we show that modifying the dynamic range, applying a per-word confidence metric, and using geometric rather than linear combinations with n-grams produces a more robust language model which has a lower perplexity on a wall street journal testset than a baseline n-gram model citee id:447 citee title:document space models using latent semantic analysis citee abstract:in this paper, an approach for constructing mixture language models (lms) based on some notion of semantics is discussed. to this end, a technique known as latent semantic analysis (lsa) is used. the approach encapsulates corpus-derived semantic information and is able to model the varying style of the text. using such information, the corpus texts are clustered in an unsupervised manner and mixture lms are automatically created. this work builds on previous work in the field of information surrounding text:[8]<2>, [6]<2>). in previous work, we [1]<2> and others [2]<2>, [***]<2> have suggested the use of latent semantic analysis (lsa) [3]<1> as a model of semantic knowledge to be applied to asr. lsa is a model of word semantic similarity based on word co-occurrence tendencies, and has been successful in ir and nlp applications, such as spelling correction [7]<3> influence:1 type:2 pair index:739 citer id:659 citer title:towards better integration of semantic predictors in statistical language modeling citer abstract:we introduce a number of techniques designed to help integrate semantic knowledge with n-gram language models for automatic speech recognition. our techniques allow us to integrate latent semantic analysis (lsa), a word-similarity algorithm based on word co-occurrence information, with n-gram models. while lsa is good at predicting content words which are coherent with the rest of a text, it is a bad predictor of frequent words, has a low dynamic range, and is inaccurate when combined linearly with n-grams. we show that modifying the dynamic range, applying a per-word confidence metric, and using geometric rather than linear combinations with n-grams produces a more robust language model which has a lower perplexity on a wall street journal testset than a baseline n-gram model citee id:642 citee title:improving and predicting performance of statistical language models in sparse domains citee abstract:standard statistical language models, or n-gram models, which represent the probability of word sequences, suffer from sparse-data problems in tasks where large amounts of domain-specific text are not available. this thesis focuses on improving the estimation of domain-dependent n-gram models by using out-of-domain text data. previous approaches for estimating language models from multi-domain data have not accounted for the characteristic variations of style and content across domains surrounding text:g. [8]<2>, [***]<2>). in previous work, we [1]<2> and others [2]<2>, [5]<2> have suggested the use of latent semantic analysis (lsa) [3]<1> as a model of semantic knowledge to be applied to asr influence:2 type:2 pair index:740 citer id:659 citer title:towards better integration of semantic predictors in statistical language modeling citer abstract:we introduce a number of techniques designed to help integrate semantic knowledge with n-gram language models for automatic speech recognition. our techniques allow us to integrate latent semantic analysis (lsa), a word-similarity algorithm based on word co-occurrence information, with n-gram models. while lsa is good at predicting content words which are coherent with the rest of a text, it is a bad predictor of frequent words, has a low dynamic range, and is inaccurate when combined linearly with n-grams. we show that modifying the dynamic range, applying a per-word confidence metric, and using geometric rather than linear combinations with n-grams produces a more robust language model which has a lower perplexity on a wall street journal testset than a baseline n-gram model citee id:372 citee title:contextual spelling correction using latent semantic analysis citee abstract:contextual spelling errors are defined as the use of an incorrect, though valid, word in a particular sentence or context. traditional spelling checkers flag misspelled words, but they do not typically attempt to identify words that are used incorrectly in a sentence. we explore the use of latent semantic analysis for correcting these incorrectly used words and the results are compared to earlier work based on a bayesian classifier surrounding text:in previous work, we [1]<2> and others [2]<2>, [5]<2> have suggested the use of latent semantic analysis (lsa) [3]<1> as a model of semantic knowledge to be applied to asr. lsa is a model of word semantic similarity based on word co-occurrence tendencies, and has been successful in ir and nlp applications, such as spelling correction [***]<3>. lsa is good at predicting the presence of words in the domain of the text, but not good at predicting their exact location influence:3 type:2 pair index:741 citer id:659 citer title:towards better integration of semantic predictors in statistical language modeling citer abstract:we introduce a number of techniques designed to help integrate semantic knowledge with n-gram language models for automatic speech recognition. our techniques allow us to integrate latent semantic analysis (lsa), a word-similarity algorithm based on word co-occurrence information, with n-gram models. while lsa is good at predicting content words which are coherent with the rest of a text, it is a bad predictor of frequent words, has a low dynamic range, and is inaccurate when combined linearly with n-grams. we show that modifying the dynamic range, applying a per-word confidence metric, and using geometric rather than linear combinations with n-grams produces a more robust language model which has a lower perplexity on a wall street journal testset than a baseline n-gram model citee id:126 citee title:a maximum entropy approach to adaptive statistical language modeling citee abstract:anadaptive statistical language model is described, which successfully integrates long distance linguistic information with other knowledge sources. most existing statistical language models exploit only the immediate history of a text. to extract information from further back in the documents history, we propose and use trigger pairs as the basic information bearing elements. this allows the model to adapt its expectations to the topic of discourse. next, statistical evidence from multiple sources must be combined. traditionally, linear interpolation and its variants have been used, but these are shown here to be seriously deficient. instead, we apply the principle of maximum entropy (me). each information source gives rise to a set of constraints, to be imposed on the combined estimate. the intersection of these constraints is the set of probability functions which are consistent with all the information sources. the function with the highest entropy within that set is the me solution. given consistent statistical evidence, a unique me solution is guaranteed to exist, and an iterative algorithm exists which is guaranteed to converge to it. the me framework is extremely general: any phenomenon that can be described in terms of statistics of the text can be readily incorporated. an adaptive language model based on the me approach was trained on the wall street journal corpus, and showed 32%c39% perplexity reduction over the baseline. when interfaced to sphinx-ii, carnegie mellons speech recognizer, it reduced its error rate by 10%c14%. this thus illustrates the feasibility of incorporating many diverse knowledge sources in a single, unified statistical framework. surrounding text:g. [***]<2>, [6]<2>). in previous work, we [1]<2> and others [2]<2>, [5]<2> have suggested the use of latent semantic analysis (lsa) [3]<1> as a model of semantic knowledge to be applied to asr influence:2 type:2 pair index:742 citer id:665 citer title:io-top-k: index-acces optimized top-k query processing citer abstract:top-k query processing is an important building block for ranked retrieval, with applications ranging from text and data integration to distributed aggregation of network logs and sensor data. top-k queries operate on index lists for a querys elementary conditions and aggregate scores for result candidates. one of the best implementation methods in this setting is the family of threshold algorithms, which aim to terminate the index scans as early as possible based on lower and upper bounds for the final scores of result candidates. this procedure performs sequential disk accesses for sorted index scans, but also has the option of performing random accesses to resolve score uncertainty. this entails scheduling for the two kinds of accesses: 1) the prioritization of different index lists in the sequential accesses, and 2) the decision on when to perform random accesses and for which candidates. the prior literature has studied some of these scheduling issues, but only for each of the two access types in isolation. the current paper takes an integrated view of the scheduling issues and develops novel strategies that outperform prior proposals by a large margin. our main contributions are new, principled, scheduling methods based on a knapsackrelated optimization for sequential accesses and a cost model for random accesses. the methods can be further boosted by harnessing probabilistic estimators for scores, selectivities, and index list correlations. in performance experiments with three different datasets (trec terabyte, http server logs, and imdb), our methods achieved significant performance gains compared to the best previously known methods citee id:243 citee title:automated ranking of database query results citee abstract:ranking and returning the most relevant results of a query is a popular paradigm in information retrieval. we discuss challenges and investigate several approaches to enable ranking in databases, including adaptations of known techniques from information retrieval. we present results of preliminary experiments surrounding text:introduction 1. 1 motivation top-k query processing is a key building block for data discovery and ranking and has been intensively studied in the context of information retrieval [6, 21, 26]<2>, multimedia similarity search [10, 11, 12, 22]<2>, text and data integration [15, 18]<2>, business analytics [***]<2>, preference queries over product catalogs and internet-based recommendation sources [3, 22]<2>, distributed aggregation of network logs and sensor data [7]<2>, and many other important application areas. such queries evaluate search conditions over multiple attributes or text keywords, assign a numeric score that reflects the similarity or relevance of a candidate record or document for each condition, then combine these scores by a monotonic aggregation function such as weighted summation, and finally return the top-k results that have the highest total scores influence:2 type:3 pair index:743 citer id:665 citer title:io-top-k: index-acces optimized top-k query processing citer abstract:top-k query processing is an important building block for ranked retrieval, with applications ranging from text and data integration to distributed aggregation of network logs and sensor data. top-k queries operate on index lists for a querys elementary conditions and aggregate scores for result candidates. one of the best implementation methods in this setting is the family of threshold algorithms, which aim to terminate the index scans as early as possible based on lower and upper bounds for the final scores of result candidates. this procedure performs sequential disk accesses for sorted index scans, but also has the option of performing random accesses to resolve score uncertainty. this entails scheduling for the two kinds of accesses: 1) the prioritization of different index lists in the sequential accesses, and 2) the decision on when to perform random accesses and for which candidates. the prior literature has studied some of these scheduling issues, but only for each of the two access types in isolation. the current paper takes an integrated view of the scheduling issues and develops novel strategies that outperform prior proposals by a large margin. our main contributions are new, principled, scheduling methods based on a knapsackrelated optimization for sequential accesses and a cost model for random accesses. the methods can be further boosted by harnessing probabilistic estimators for scores, selectivities, and index list correlations. in performance experiments with three different datasets (trec terabyte, http server logs, and imdb), our methods achieved significant performance gains compared to the best previously known methods citee id:457 citee title:eddies: continuously adaptive query processing citee abstract:in large federated and shared-nothing databases, resources can exhibit widely fluctuating characteristics. assumptions made at the time a query is submitted will rarely hold throughout the duration of query processing. as a result, traditional static query optimization and execution techniques are ineffective in these environments. in this paper we introduce a query processing mechanism called an eddy, which continuously reorders operators in a query plan as it runs. we characterize the... surrounding text:their approach is based on a dbms-oriented compile-time view: they consider only binary rank joins and a join tree to combine the index lists for all attributes or keywords of the query, and they generate the execution plan before query execution starts. an alternative, run-time-oriented, approach follows the eddies-style notion of adaptive join orders on a per tuple basis [***]<2> rather than fixing join orders at compiletime. then the query optimization for top-k queries with threshold-driven evaluation becomes a scheduling problem. this is the approach that we pursue in this paper. in contrast to [***, 15]<2> we do not restrict ourselves to trees of binary joins, but consider all index lists relevant to the query together. 475 list 1 list 2 list 3 doc17 : 0 influence:2 type:2 pair index:744 citer id:665 citer title:io-top-k: index-acces optimized top-k query processing citer abstract:top-k query processing is an important building block for ranked retrieval, with applications ranging from text and data integration to distributed aggregation of network logs and sensor data. top-k queries operate on index lists for a querys elementary conditions and aggregate scores for result candidates. one of the best implementation methods in this setting is the family of threshold algorithms, which aim to terminate the index scans as early as possible based on lower and upper bounds for the final scores of result candidates. this procedure performs sequential disk accesses for sorted index scans, but also has the option of performing random accesses to resolve score uncertainty. this entails scheduling for the two kinds of accesses: 1) the prioritization of different index lists in the sequential accesses, and 2) the decision on when to perform random accesses and for which candidates. the prior literature has studied some of these scheduling issues, but only for each of the two access types in isolation. the current paper takes an integrated view of the scheduling issues and develops novel strategies that outperform prior proposals by a large margin. our main contributions are new, principled, scheduling methods based on a knapsackrelated optimization for sequential accesses and a cost model for random accesses. the methods can be further boosted by harnessing probabilistic estimators for scores, selectivities, and index list correlations. in performance experiments with three different datasets (trec terabyte, http server logs, and imdb), our methods achieved significant performance gains compared to the best previously known methods citee id:462 citee title:efficient distributed skylining for web information systems citee abstract:though skyline queries already have claimed their place in retrieval over central databases, their application in web information systems up to now was impossible due to the distributed aspect of retrieval over web sources. but due to the amount, variety and volatile nature of information accessible over the internet extended query capabilities are crucial. we show how to efficiently perform distributed skyline queries and thus essentially extend the expressiveness of querying todays web information systems. together with our innovative retrieval algorithm we also present useful heuristics to further speed up the retrieval in most practical cases paving the road towards meeting even the realtime challenges of on-line information services. we discuss performance evaluations and point to open problems in the concept and application of skylining in modern information systems. for the curse of dimensionality, an intrinsic problem in skyline queries, we propose a novel sampling scheme that allows to get an early impression of the skyline for subsequent query refinement surrounding text:introduction 1. 1 motivation top-k query processing is a key building block for data discovery and ranking and has been intensively studied in the context of information retrieval [6, 21, 26]<2>, multimedia similarity search [10, 11, 12, 22]<2>, text and data integration [15, 18]<2>, business analytics [1]<2>, preference queries over product catalogs and internet-based recommendation sources [***, 22]<2>, distributed aggregation of network logs and sensor data [7]<2>, and many other important application areas. such queries evaluate search conditions over multiple attributes or text keywords, assign a numeric score that reflects the similarity or relevance of a candidate record or document for each condition, then combine these scores by a monotonic aggregation function such as weighted summation, and finally return the top-k results that have the highest total scores influence:2 type:3 pair index:745 citer id:665 citer title:io-top-k: index-acces optimized top-k query processing citer abstract:top-k query processing is an important building block for ranked retrieval, with applications ranging from text and data integration to distributed aggregation of network logs and sensor data. top-k queries operate on index lists for a querys elementary conditions and aggregate scores for result candidates. one of the best implementation methods in this setting is the family of threshold algorithms, which aim to terminate the index scans as early as possible based on lower and upper bounds for the final scores of result candidates. this procedure performs sequential disk accesses for sorted index scans, but also has the option of performing random accesses to resolve score uncertainty. this entails scheduling for the two kinds of accesses: 1) the prioritization of different index lists in the sequential accesses, and 2) the decision on when to perform random accesses and for which candidates. the prior literature has studied some of these scheduling issues, but only for each of the two access types in isolation. the current paper takes an integrated view of the scheduling issues and develops novel strategies that outperform prior proposals by a large margin. our main contributions are new, principled, scheduling methods based on a knapsackrelated optimization for sequential accesses and a cost model for random accesses. the methods can be further boosted by harnessing probabilistic estimators for scores, selectivities, and index list correlations. in performance experiments with three different datasets (trec terabyte, http server logs, and imdb), our methods achieved significant performance gains compared to the best previously known methods citee id:248 citee title:evaluating top-k queries over web-accessible databases citee abstract:a query to a web search engine usually consists of a list of keywords, to which the search engine responds with the best or top k pages for the query. this top-k query model is prevalent over multimedia collections in general, but also over plain relational data for certain applications. for example, consider a relation with information on available restaurants, including their location, price range for one diner, and overall food rating. a user who queries such a relation might simply specify the users location and target price range, and expect in return the best 10 restaurants in terms of some combination of proximity to the user, closeness of match to the target price range, and overall food rating. processing such top-k queries efficiently is challenging for a number of reasons. one critical such reason is that, in many web applications, the relation attributes might not be available other than through external web-accessible form interfaces, which we will have to query repeatedly for a potentially large set of candidate objects. in this paper, we study how to process topk queries efficiently in this setting, where the attributes for which users specify target values might be handled by external, autonomous sources with a variety of access interfaces. we present several algorithms for processing such queries, and evaluate them thoroughly using both synthetic and real web-accessible data. surrounding text:this leads to preferring sas on index lists with steep gradient [14]<2>. 476 [9]<2> and [***, 22]<2> developed the strategies mpro, upper, and pick for scheduling ras on expensive predicates. they considered restricted attribute sources, such as non-indexed attributes or internet sites that do not support sorted access at all (e. our computational model differs from these settings in that we assume that all attributes are indexes with support for both sa and ra and that all index lists are on the same server and thus have identical access costs. for our setting, mpro [9]<2> is essentially the same as the upper method developed in [***, 22]<2>. upper alternates between ra and sa steps. then sa are scheduled in a round-robin way until such a data item appears again. [***]<2> also developed the pick method that runs in two phases. in the first phase, it makes only sa until all potential result documents have been read. it presents stress tests and large-scale performance experiments that demonstrate the viability and significant benefits of the proposed scheduling strategies. on three different datasets (trec terabyte, http server logs, and imdb), our methods achieve significant performance gains compared to the best previously known methods, fagins combined algorithm (ca) and variants of the upper and pick [***, 22]<2> algorithms: a factor of up to 3 in terms of abstract execution costs, and a factor of 5 in terms of absolute run-times of our implementation. we also show that our best techniques are within 20 percent of a lower bound for the execution cost of any top-k algorithm from the ta family. , 22 if we want the top-10 of a 3-word query), and, it seems, way too pessimistic. [***]<2> presented a way to compute a lower bound on the cost of individual queries for the special case of queries with only a single indexed attribute we extend their approach to our setting where all lists can be accessed with sorted and random accesses. for any topk query processing method, after it has done its last sa, consider the set x of documents which were seen in at least one of the sorted accesses, and which have a bestscore not only above the current min-k score, but even above the final min-k score (which the method does not know at this time). 5 to assess how close our algorithms get to the optimum. we also ran our experiments for the ra-extensive threshold algorithms ta, upper[***, 22]<2> and pick[***]<2>. in our setting, where both sorted and random access is possible and a random access is much more expensive than a sorted access (the lowest ratio we consider is 100), all these methods performed considerably worse than even the full merge baseline, in terms of both costs and running times, and for all values of k and cr/cs we considered. 5 to assess how close our algorithms get to the optimum. we also ran our experiments for the ra-extensive threshold algorithms ta, upper[***, 22]<2> and pick[***]<2>. in our setting, where both sorted and random access is possible and a random access is much more expensive than a sorted access (the lowest ratio we consider is 100), all these methods performed considerably worse than even the full merge baseline, in terms of both costs and running times, and for all values of k and cr/cs we considered influence:1 type:2 pair index:746 citer id:665 citer title:io-top-k: index-acces optimized top-k query processing citer abstract:top-k query processing is an important building block for ranked retrieval, with applications ranging from text and data integration to distributed aggregation of network logs and sensor data. top-k queries operate on index lists for a querys elementary conditions and aggregate scores for result candidates. one of the best implementation methods in this setting is the family of threshold algorithms, which aim to terminate the index scans as early as possible based on lower and upper bounds for the final scores of result candidates. this procedure performs sequential disk accesses for sorted index scans, but also has the option of performing random accesses to resolve score uncertainty. this entails scheduling for the two kinds of accesses: 1) the prioritization of different index lists in the sequential accesses, and 2) the decision on when to perform random accesses and for which candidates. the prior literature has studied some of these scheduling issues, but only for each of the two access types in isolation. the current paper takes an integrated view of the scheduling issues and develops novel strategies that outperform prior proposals by a large margin. our main contributions are new, principled, scheduling methods based on a knapsackrelated optimization for sequential accesses and a cost model for random accesses. the methods can be further boosted by harnessing probabilistic estimators for scores, selectivities, and index list correlations. in performance experiments with three different datasets (trec terabyte, http server logs, and imdb), our methods achieved significant performance gains compared to the best previously known methods citee id:666 citee title:the effect of adding relevance information in a relevance feedback environment citee abstract:the effects of adding information from relevant documents are examined in the trec routing environmentfia modified rocchio relevance feedback approach is used, with a varying number of relevantdocuments retrieved by an initial smart search, and a varying number of terms from those relevantdocuments used to expand the initial queryfi recall-precision evaluation reveals that as the amount ofexpansion of the query due to adding terms from relevant documents increases, so does the effectivenessfi surrounding text:introduction 1. 1 motivation top-k query processing is a key building block for data discovery and ranking and has been intensively studied in the context of information retrieval [***, 21, 26]<2>, multimedia similarity search [10, 11, 12, 22]<2>, text and data integration [15, 18]<2>, business analytics [1]<2>, preference queries over product catalogs and internet-based recommendation sources [3, 22]<2>, distributed aggregation of network logs and sensor data [7]<2>, and many other important application areas. such queries evaluate search conditions over multiple attributes or text keywords, assign a numeric score that reflects the similarity or relevance of a candidate record or document for each condition, then combine these scores by a monotonic aggregation function such as weighted summation, and finally return the top-k results that have the highest total scores influence:2 type:3 pair index:747 citer id:665 citer title:io-top-k: index-acces optimized top-k query processing citer abstract:top-k query processing is an important building block for ranked retrieval, with applications ranging from text and data integration to distributed aggregation of network logs and sensor data. top-k queries operate on index lists for a querys elementary conditions and aggregate scores for result candidates. one of the best implementation methods in this setting is the family of threshold algorithms, which aim to terminate the index scans as early as possible based on lower and upper bounds for the final scores of result candidates. this procedure performs sequential disk accesses for sorted index scans, but also has the option of performing random accesses to resolve score uncertainty. this entails scheduling for the two kinds of accesses: 1) the prioritization of different index lists in the sequential accesses, and 2) the decision on when to perform random accesses and for which candidates. the prior literature has studied some of these scheduling issues, but only for each of the two access types in isolation. the current paper takes an integrated view of the scheduling issues and develops novel strategies that outperform prior proposals by a large margin. our main contributions are new, principled, scheduling methods based on a knapsackrelated optimization for sequential accesses and a cost model for random accesses. the methods can be further boosted by harnessing probabilistic estimators for scores, selectivities, and index list correlations. in performance experiments with three different datasets (trec terabyte, http server logs, and imdb), our methods achieved significant performance gains compared to the best previously known methods citee id:504 citee title:efficient top-k query calculation in distributed networks citee abstract:width, typically 128kb/s to 2mb/sfi to enable web and streaming media applications this paper presents a new algorithm to answer top-k queries (efi gfi "find the k objects with the highest ag-gregate values") in a distributed networkfi existing algorithms such as the threshold algorithm consume an excessive amount of bandwidth when the number of nodes, m, is highfi we propose a new algorithm called "three-phase uniform threshold" (tput)fi tput reduces network bandwidth con-sumption by pruning away ineligible objects, and ter-minates in three round-trips regardless of data inputfi the paper presents two sets of results about tputfi first, trace-driven simulations show that, de-pending on the size of the network, tput reduces network tra c by one to two orders of magnitude compared to existing algorithmsfi second, tput is proven to be instance-optimal on data series that sat-isfy a lower bound on the slope of decreases in valuesfi in particular, analysis shows that by using a pruning parameter a < 1, tput achieves a qualitative re-duction in network tra c, for example, lowering the optimality ratio from o(m * m) to o(m *,m) for data series following zipf distributionfi surrounding text:introduction 1. 1 motivation top-k query processing is a key building block for data discovery and ranking and has been intensively studied in the context of information retrieval [6, 21, 26]<2>, multimedia similarity search [10, 11, 12, 22]<2>, text and data integration [15, 18]<2>, business analytics [1]<2>, preference queries over product catalogs and internet-based recommendation sources [3, 22]<2>, distributed aggregation of network logs and sensor data [***]<2>, and many other important application areas. such queries evaluate search conditions over multiple attributes or text keywords, assign a numeric score that reflects the similarity or relevance of a candidate record or document for each condition, then combine these scores by a monotonic aggregation function such as weighted summation, and finally return the top-k results that have the highest total scores influence:2 type:2,3 pair index:748 citer id:665 citer title:io-top-k: index-acces optimized top-k query processing citer abstract:top-k query processing is an important building block for ranked retrieval, with applications ranging from text and data integration to distributed aggregation of network logs and sensor data. top-k queries operate on index lists for a querys elementary conditions and aggregate scores for result candidates. one of the best implementation methods in this setting is the family of threshold algorithms, which aim to terminate the index scans as early as possible based on lower and upper bounds for the final scores of result candidates. this procedure performs sequential disk accesses for sorted index scans, but also has the option of performing random accesses to resolve score uncertainty. this entails scheduling for the two kinds of accesses: 1) the prioritization of different index lists in the sequential accesses, and 2) the decision on when to perform random accesses and for which candidates. the prior literature has studied some of these scheduling issues, but only for each of the two access types in isolation. the current paper takes an integrated view of the scheduling issues and develops novel strategies that outperform prior proposals by a large margin. our main contributions are new, principled, scheduling methods based on a knapsackrelated optimization for sequential accesses and a cost model for random accesses. the methods can be further boosted by harnessing probabilistic estimators for scores, selectivities, and index list correlations. in performance experiments with three different datasets (trec terabyte, http server logs, and imdb), our methods achieved significant performance gains compared to the best previously known methods citee id:249 citee title:on saying "enough already!" in sql citee abstract: in this paper, we study a simple sql extension that enables query writers to explicitly limit the cardinality of a query result we examine its impact on the query optimization and run - time execution components of a relational dbms, presenting two approaches - a conservative approach and an aggressive approach - to exploiting cardinality limits in relational query plans results obtained from an empirical study conducted using db2 demonstrate the benefits of the sql extension and illustrate the tradeoffs between our two approaches to implementing it surrounding text:additionally, for promising candidates, unknown scores for some attributes can be looked up with random accesses, making the score bounds more precise. when scanning multiple index lists (over attributes from one or more relations or document collections), top-k query processing faces an optimization problem: combining each pair of indexes is essentially an equi-join (via equality of the tuple or document ids in matching index entries), and we thus need to solve a join ordering problem [***, 15, 20]<2>. as top-k queries are eventually interested only in the highestscore results, the problem is not just standard join ordering but has additional complexity influence:3 type:3 pair index:749 citer id:665 citer title:io-top-k: index-acces optimized top-k query processing citer abstract:top-k query processing is an important building block for ranked retrieval, with applications ranging from text and data integration to distributed aggregation of network logs and sensor data. top-k queries operate on index lists for a querys elementary conditions and aggregate scores for result candidates. one of the best implementation methods in this setting is the family of threshold algorithms, which aim to terminate the index scans as early as possible based on lower and upper bounds for the final scores of result candidates. this procedure performs sequential disk accesses for sorted index scans, but also has the option of performing random accesses to resolve score uncertainty. this entails scheduling for the two kinds of accesses: 1) the prioritization of different index lists in the sequential accesses, and 2) the decision on when to perform random accesses and for which candidates. the prior literature has studied some of these scheduling issues, but only for each of the two access types in isolation. the current paper takes an integrated view of the scheduling issues and develops novel strategies that outperform prior proposals by a large margin. our main contributions are new, principled, scheduling methods based on a knapsackrelated optimization for sequential accesses and a cost model for random accesses. the methods can be further boosted by harnessing probabilistic estimators for scores, selectivities, and index list correlations. in performance experiments with three different datasets (trec terabyte, http server logs, and imdb), our methods achieved significant performance gains compared to the best previously known methods citee id:667 citee title:minimal probing: supporting expensive predicates for top-k queries citee abstract:this paper addresses the problem of evaluating ranked topfi queries with expensive predicates. as major dbmss now all support expensive user-defined predicates for boolean queries, we believe such support for ranked queries will be even more important: first, ranked queries often need to model user-specific concepts of preference, relevance, or similarity, which call for dynamic user-defined functions. second, middleware systems must incorporate external predicates for integrating autonomous sources typically accessible only by per-object queries. third, fuzzy joins are inherently expensive, as they are essentially user-defined operations that dynamically associate multiple relations. these predicates, being dynamically defined or externally accessed, cannot rely on index mechanisms to provide zero-time sorted output, and must instead require per-object probe to evaluate. the current standard sort-merge framework for ranked queries cannot efficiently handle such predicates because it must completely probe all objects, before sorting and merging them to produce topfi answers. to minimize expensive probes, we thus develop the formal principle of necessary probes, which determines if a probe is absolutely required. we then propose algorithm mpro which, by implementing the principle, is provably optimal with minimal probe cost. further, we show that mpro can scale well and can be easily parallelized. our experiments using both a real-estate benchmark database and synthetic datasets show that mpro enables significant probe reduction, which can be orders of magnitude faster than the standard scheme using complete probing surrounding text:we realize that this may not always be possible, for example, when a sql query with a stop-after clause uses non-indexed attributes in the order-by clause. the latter situation may arise, for example, when expensive user-defined predicates are involved in the query [***, 10, 20]<2> (e. g. this leads to preferring sas on index lists with steep gradient [14]<2>. 476 [***]<2> and [5, 22]<2> developed the strategies mpro, upper, and pick for scheduling ras on expensive predicates. they considered restricted attribute sources, such as non-indexed attributes or internet sites that do not support sorted access at all (e. our computational model differs from these settings in that we assume that all attributes are indexes with support for both sa and ra and that all index lists are on the same server and thus have identical access costs. for our setting, mpro [***]<2> is essentially the same as the upper method developed in [5, 22]<2>. upper alternates between ra and sa steps influence:2 type:2 pair index:750 citer id:665 citer title:io-top-k: index-acces optimized top-k query processing citer abstract:top-k query processing is an important building block for ranked retrieval, with applications ranging from text and data integration to distributed aggregation of network logs and sensor data. top-k queries operate on index lists for a querys elementary conditions and aggregate scores for result candidates. one of the best implementation methods in this setting is the family of threshold algorithms, which aim to terminate the index scans as early as possible based on lower and upper bounds for the final scores of result candidates. this procedure performs sequential disk accesses for sorted index scans, but also has the option of performing random accesses to resolve score uncertainty. this entails scheduling for the two kinds of accesses: 1) the prioritization of different index lists in the sequential accesses, and 2) the decision on when to perform random accesses and for which candidates. the prior literature has studied some of these scheduling issues, but only for each of the two access types in isolation. the current paper takes an integrated view of the scheduling issues and develops novel strategies that outperform prior proposals by a large margin. our main contributions are new, principled, scheduling methods based on a knapsackrelated optimization for sequential accesses and a cost model for random accesses. the methods can be further boosted by harnessing probabilistic estimators for scores, selectivities, and index list correlations. in performance experiments with three different datasets (trec terabyte, http server logs, and imdb), our methods achieved significant performance gains compared to the best previously known methods citee id:668 citee title:optimizing top-k selection queries over multimedia repositories citee abstract: - repositories of multimedia objects having multiple types of attributes (e g , image, text) are becoming increasingly common a query on these attributes will typically request not just a set of objects, as in the traditional relational query model (filtering), but also a grade of match associated with each object, which indicates how well the object matches the selection condition (ranking) furthermore, unlike in the relational model, users may just want the k top - ranked objects for their selection queries for a relatively small k in addition to the differences in the query model, another peculiarity of multimedia repositories is that they may allow access to the attributes of each object only through indexes in this paper, we investigate how to optimize the processing of top - k selection queries over multimedia repositories the access characteristics of the repositories and the above query model lead to novel issues in query optimization in particular, the choice of the indexes used to search the repository strongly influences the cost of processing the filtering condition we define an execution space that is search - minimal , i e , the set of indexes searched is minimal although the general problem of picking an optimal plan in the search - minimal execution space is np - hard, we present an efficient algorithm that solves the problem optimally with respect to our cost model and execution space when the predicates in the query are independent we also show that the problem of optimizing top - k selection queries can be viewed, in many cases, as that of evaluating more traditional selection conditions thus, both problems can be viewed together as an extended filtering problem to which techniques of query processing and optimization may be adapted surrounding text:introduction 1. 1 motivation top-k query processing is a key building block for data discovery and ranking and has been intensively studied in the context of information retrieval [6, 21, 26]<2>, multimedia similarity search [***, 11, 12, 22]<2>, text and data integration [15, 18]<2>, business analytics [1]<2>, preference queries over product catalogs and internet-based recommendation sources [3, 22]<2>, distributed aggregation of network logs and sensor data [7]<2>, and many other important application areas. such queries evaluate search conditions over multiple attributes or text keywords, assign a numeric score that reflects the similarity or relevance of a candidate record or document for each condition, then combine these scores by a monotonic aggregation function such as weighted summation, and finally return the top-k results that have the highest total scores. we realize that this may not always be possible, for example, when a sql query with a stop-after clause uses non-indexed attributes in the order-by clause. the latter situation may arise, for example, when expensive user-defined predicates are involved in the query [9, ***, 20]<2> (e. g influence:1 type:2,3 pair index:751 citer id:665 citer title:io-top-k: index-acces optimized top-k query processing citer abstract:top-k query processing is an important building block for ranked retrieval, with applications ranging from text and data integration to distributed aggregation of network logs and sensor data. top-k queries operate on index lists for a querys elementary conditions and aggregate scores for result candidates. one of the best implementation methods in this setting is the family of threshold algorithms, which aim to terminate the index scans as early as possible based on lower and upper bounds for the final scores of result candidates. this procedure performs sequential disk accesses for sorted index scans, but also has the option of performing random accesses to resolve score uncertainty. this entails scheduling for the two kinds of accesses: 1) the prioritization of different index lists in the sequential accesses, and 2) the decision on when to perform random accesses and for which candidates. the prior literature has studied some of these scheduling issues, but only for each of the two access types in isolation. the current paper takes an integrated view of the scheduling issues and develops novel strategies that outperform prior proposals by a large margin. our main contributions are new, principled, scheduling methods based on a knapsackrelated optimization for sequential accesses and a cost model for random accesses. the methods can be further boosted by harnessing probabilistic estimators for scores, selectivities, and index list correlations. in performance experiments with three different datasets (trec terabyte, http server logs, and imdb), our methods achieved significant performance gains compared to the best previously known methods citee id:331 citee title:combining fuzzy information: an overview citee abstract:assume that each object in a database has m grades, or scores, one for each of m attributesfi surrounding text:introduction 1. 1 motivation top-k query processing is a key building block for data discovery and ranking and has been intensively studied in the context of information retrieval [6, 21, 26]<2>, multimedia similarity search [10, ***, 12, 22]<2>, text and data integration [15, 18]<2>, business analytics [1]<2>, preference queries over product catalogs and internet-based recommendation sources [3, 22]<2>, distributed aggregation of network logs and sensor data [7]<2>, and many other important application areas. such queries evaluate search conditions over multiple attributes or text keywords, assign a numeric score that reflects the similarity or relevance of a candidate record or document for each condition, then combine these scores by a monotonic aggregation function such as weighted summation, and finally return the top-k results that have the highest total scores. our goal is to minimize the sum of the access costs, assuming a fixed cost cs for each sorted access and a fixed cost cr for each random access. the same assumptions were made in [***]<2>. we also study how to leverage statistics on score distributions for the scheduling of indexscan steps. to remedy this, [12, 14]<2> proposed the nra (no ra) variant of ta, but occasional, carefully scheduled ras can still be useful when they can contribute to major pruning of candidates. therefore, [***]<2> also introduced a combined algorithm (ca) framework but did not discuss any data- or scoring-specific scheduling strategies. [14]<2> developed heuristic strategies for scheduling sas over multiple lists influence:3 type:3 pair index:752 citer id:665 citer title:io-top-k: index-acces optimized top-k query processing citer abstract:top-k query processing is an important building block for ranked retrieval, with applications ranging from text and data integration to distributed aggregation of network logs and sensor data. top-k queries operate on index lists for a querys elementary conditions and aggregate scores for result candidates. one of the best implementation methods in this setting is the family of threshold algorithms, which aim to terminate the index scans as early as possible based on lower and upper bounds for the final scores of result candidates. this procedure performs sequential disk accesses for sorted index scans, but also has the option of performing random accesses to resolve score uncertainty. this entails scheduling for the two kinds of accesses: 1) the prioritization of different index lists in the sequential accesses, and 2) the decision on when to perform random accesses and for which candidates. the prior literature has studied some of these scheduling issues, but only for each of the two access types in isolation. the current paper takes an integrated view of the scheduling issues and develops novel strategies that outperform prior proposals by a large margin. our main contributions are new, principled, scheduling methods based on a knapsackrelated optimization for sequential accesses and a cost model for random accesses. the methods can be further boosted by harnessing probabilistic estimators for scores, selectivities, and index list correlations. in performance experiments with three different datasets (trec terabyte, http server logs, and imdb), our methods achieved significant performance gains compared to the best previously known methods citee id:254 citee title:optimal aggregation algorithms for middleware citee abstract:assume that each object in a database has m grades, or scores, one for eachof m attributes. for example, an object can have a color grade, that tells how red it is, and a shape grade, that tells how round it is. for each attribute, there is a sorted list, which lists each object and its grade under that attribute, sorted by grade (highest grade first). each object is assigned an overall grade, that is obtained by combining the attribute grades using a fixed monotone aggregation function, or combining rule, suchas min or average. to determine the top k objects, that is, k objects with the highest overall grades, the naive algorithm must access every object in the database, to find its grade under each attribute. fagin has given an algorithm (fagins algorithm, or fa) that is much more efficient. for some monotone aggregation functions, fa is optimal with high probability in the worst case. we analyze an elegant and remarkably simple algorithm (the threshold algorithm, or ta) that is optimal in a much stronger sense than fa. we show that ta is essentially optimal, not just for some monotone aggregation functions, but for all of them, and not just in a high-probability worst-case sense, but over every database. unlike fa, which requires large buffers (whose size may grow unboundedly as the database size grows), ta requires only a small, constant-size buffer. ta allows early stopping, which yields, in a precise sense, an approximate version of the top k answers. we distinguish two types of access: sorted access (where the middleware system obtains the grade of an object in some sorted list by proceeding through the list sequentially from the top), and random access (where the middleware system requests the grade of object in a list, and obtains it in one step). we consider the scenarios where random access is either impossible, or expensive relative to sorted access, and provide algorithms that are essentially optimal for these cases as well. r 2003 elsevier science (usa). all rights reserved. surrounding text:introduction 1. 1 motivation top-k query processing is a key building block for data discovery and ranking and has been intensively studied in the context of information retrieval [6, 21, 26]<2>, multimedia similarity search [10, 11, ***, 22]<2>, text and data integration [15, 18]<2>, business analytics [1]<2>, preference queries over product catalogs and internet-based recommendation sources [3, 22]<2>, distributed aggregation of network logs and sensor data [7]<2>, and many other important application areas. such queries evaluate search conditions over multiple attributes or text keywords, assign a numeric score that reflects the similarity or relevance of a candidate record or document for each condition, then combine these scores by a monotonic aggregation function such as weighted summation, and finally return the top-k results that have the highest total scores. such queries evaluate search conditions over multiple attributes or text keywords, assign a numeric score that reflects the similarity or relevance of a candidate record or document for each condition, then combine these scores by a monotonic aggregation function such as weighted summation, and finally return the top-k results that have the highest total scores. the method that has been most strongly advocated in recent years is the family of threshold algorithms (ta) [***, 14, 25]<2> that perform index scans over precomputed index lists, one for each attribute or keyword in the query, which are sorted in descending order of per-attribute or per-keyword scores. the key point of ta is that it aggregates scores on the fly, thus computes a lower bound for the total score of the current rank-k result record (document) and an upper bound for the total scores of all other candidate records (documents), and is thus often able to terminate the index scans long before it reaches the bottom of the index lists, namely, when the lower bound for the rank-k result, the threshold, is at least as high as the upper bound for all other candidates. early variants also made intensive use of random access (ra) to index entries to resolve missing score values of result candidates, but for very large index lists with millions of entries that span multiple disk tracks, the resulting random access cost cr is 50 - 50,000 times higher than the cost cs of a sorted access (sa). to remedy this, [***, 14]<2> proposed the nra (no ra) variant of ta, but occasional, carefully scheduled ras can still be useful when they can contribute to major pruning of candidates. therefore, [11]<2> also introduced a combined algorithm (ca) framework but did not discuss any data- or scoring-specific scheduling strategies. 2. 5 computing lower bounds [***]<2> proved that the ca algorithm has costs that are always within a factor of 4m+ k of the optimum, where m is the number of lists and k is the number of top items we want to see. even for small values of m and k, this factor is fairly large (e influence:2 type:2,3 pair index:753 citer id:665 citer title:io-top-k: index-acces optimized top-k query processing citer abstract:top-k query processing is an important building block for ranked retrieval, with applications ranging from text and data integration to distributed aggregation of network logs and sensor data. top-k queries operate on index lists for a querys elementary conditions and aggregate scores for result candidates. one of the best implementation methods in this setting is the family of threshold algorithms, which aim to terminate the index scans as early as possible based on lower and upper bounds for the final scores of result candidates. this procedure performs sequential disk accesses for sorted index scans, but also has the option of performing random accesses to resolve score uncertainty. this entails scheduling for the two kinds of accesses: 1) the prioritization of different index lists in the sequential accesses, and 2) the decision on when to perform random accesses and for which candidates. the prior literature has studied some of these scheduling issues, but only for each of the two access types in isolation. the current paper takes an integrated view of the scheduling issues and develops novel strategies that outperform prior proposals by a large margin. our main contributions are new, principled, scheduling methods based on a knapsackrelated optimization for sequential accesses and a cost model for random accesses. the methods can be further boosted by harnessing probabilistic estimators for scores, selectivities, and index list correlations. in performance experiments with three different datasets (trec terabyte, http server logs, and imdb), our methods achieved significant performance gains compared to the best previously known methods citee id:323 citee title:information retrieval citee abstract:information retrieval is a wide, often loosely-defined term but in these pages i shall be concerned only with automatic information retrieval systemsfi automatic as opposed to manual and information as opposed to data or factfi unfortunately the word information can be very misleadingfi in the context of information retrieval (ir), information, in the technical meaning given in shannon"s theory of communication, is not readily measured (shannon and weaver1)fi in fact, in many cases one can surrounding text:g. , a desired temperature or the set value of a control parameter), and for text or semistructured documents the score could be an ir relevance measure such as tfidf or the probabilistic bm25 score derived from term frequencies (tf) and inverse document frequencies (idf) [***]<3>. we denote the score of data item dj for the ith dimension by sij. one particularity of the trec queries is that they come with larger description and narrative fields that allow the extraction of larger keyword queries. we indexed the collection with bm25 and a standard tfidf scoring model [***]<3>. we imported movie information from the internet movie database imdb2 for more than 375,000 movies and more than 1,200,000 persons (actors, directors, etc influence:3 type:3 pair index:754 citer id:665 citer title:io-top-k: index-acces optimized top-k query processing citer abstract:top-k query processing is an important building block for ranked retrieval, with applications ranging from text and data integration to distributed aggregation of network logs and sensor data. top-k queries operate on index lists for a querys elementary conditions and aggregate scores for result candidates. one of the best implementation methods in this setting is the family of threshold algorithms, which aim to terminate the index scans as early as possible based on lower and upper bounds for the final scores of result candidates. this procedure performs sequential disk accesses for sorted index scans, but also has the option of performing random accesses to resolve score uncertainty. this entails scheduling for the two kinds of accesses: 1) the prioritization of different index lists in the sequential accesses, and 2) the decision on when to perform random accesses and for which candidates. the prior literature has studied some of these scheduling issues, but only for each of the two access types in isolation. the current paper takes an integrated view of the scheduling issues and develops novel strategies that outperform prior proposals by a large margin. our main contributions are new, principled, scheduling methods based on a knapsackrelated optimization for sequential accesses and a cost model for random accesses. the methods can be further boosted by harnessing probabilistic estimators for scores, selectivities, and index list correlations. in performance experiments with three different datasets (trec terabyte, http server logs, and imdb), our methods achieved significant performance gains compared to the best previously known methods citee id:669 citee title:towards efficient multi-feature queries in heterogeneous environments citee abstract:applications like multimedia databases or enterprisewideinformation management systems have to meet thechallenge of efficiently retrieving best matching objectsfrom vast collections of datafi we present a new algorithmstream-combine for processing multi-feature queries onheterogeneous data sourcesfi stream-combine is selfadaptingto different data distributions and to the specifickind of the combining functionfi furthermore we present anew retrieval strategy that will essentially speed up surrounding text:such queries evaluate search conditions over multiple attributes or text keywords, assign a numeric score that reflects the similarity or relevance of a candidate record or document for each condition, then combine these scores by a monotonic aggregation function such as weighted summation, and finally return the top-k results that have the highest total scores. the method that has been most strongly advocated in recent years is the family of threshold algorithms (ta) [12, ***, 25]<2> that perform index scans over precomputed index lists, one for each attribute or keyword in the query, which are sorted in descending order of per-attribute or per-keyword scores. the key point of ta is that it aggregates scores on the fly, thus computes a lower bound for the total score of the current rank-k result record (document) and an upper bound for the total scores of all other candidate records (documents), and is thus often able to terminate the index scans long before it reaches the bottom of the index lists, namely, when the lower bound for the rank-k result, the threshold, is at least as high as the upper bound for all other candidates. early variants also made intensive use of random access (ra) to index entries to resolve missing score values of result candidates, but for very large index lists with millions of entries that span multiple disk tracks, the resulting random access cost cr is 50 - 50,000 times higher than the cost cs of a sorted access (sa). to remedy this, [12, ***]<2> proposed the nra (no ra) variant of ta, but occasional, carefully scheduled ras can still be useful when they can contribute to major pruning of candidates. therefore, [11]<2> also introduced a combined algorithm (ca) framework but did not discuss any data- or scoring-specific scheduling strategies. therefore, [11]<2> also introduced a combined algorithm (ca) framework but did not discuss any data- or scoring-specific scheduling strategies. [***]<2> developed heuristic strategies for scheduling sas over multiple lists. these are greedy heuristics based on limited or crude estimates of scores, namely, the score gradients up to the current cursor positions in the index scans and the average score in an index list. these are greedy heuristics based on limited or crude estimates of scores, namely, the score gradients up to the current cursor positions in the index scans and the average score in an index list. this leads to preferring sas on index lists with steep gradient [***]<2>. 476 [9]<2> and [5, 22]<2> developed the strategies mpro, upper, and pick for scheduling ras on expensive predicates influence:2 type:2 pair index:755 citer id:665 citer title:io-top-k: index-acces optimized top-k query processing citer abstract:top-k query processing is an important building block for ranked retrieval, with applications ranging from text and data integration to distributed aggregation of network logs and sensor data. top-k queries operate on index lists for a querys elementary conditions and aggregate scores for result candidates. one of the best implementation methods in this setting is the family of threshold algorithms, which aim to terminate the index scans as early as possible based on lower and upper bounds for the final scores of result candidates. this procedure performs sequential disk accesses for sorted index scans, but also has the option of performing random accesses to resolve score uncertainty. this entails scheduling for the two kinds of accesses: 1) the prioritization of different index lists in the sequential accesses, and 2) the decision on when to perform random accesses and for which candidates. the prior literature has studied some of these scheduling issues, but only for each of the two access types in isolation. the current paper takes an integrated view of the scheduling issues and develops novel strategies that outperform prior proposals by a large margin. our main contributions are new, principled, scheduling methods based on a knapsackrelated optimization for sequential accesses and a cost model for random accesses. the methods can be further boosted by harnessing probabilistic estimators for scores, selectivities, and index list correlations. in performance experiments with three different datasets (trec terabyte, http server logs, and imdb), our methods achieved significant performance gains compared to the best previously known methods citee id:670 citee title:supporting top-k join queries in relational databases citee abstract:ranking queries produce results that are ordered on some computed score. typically, these queries involve joins, where users are usually interested only in the top-k join results. current relational query processors do not handle ranking queries efficiently, especially when joins are involved. in this paper, we address supporting top-k join queries in relational query processors. we introduce a new rank-join algorithm that makes use of the individual orders of its inputs to produce join results ordered on a userspecified scoring function. the idea is to rank the join results progressively during the join operation. we introduce two physical query operators based on variants of ripple join that implement the rankjoin algorithm. the operators are non-blocking and can be integrated into pipelined execution plans. we address several practical issues and optimization heuristics to integrate the new join operators in practical query processors. we implement the new operators inside a prototype database engine based on predator. the experimental evaluation of our approach compares recent algorithms for joining ranked inputs and shows superior performance surrounding text:introduction 1. 1 motivation top-k query processing is a key building block for data discovery and ranking and has been intensively studied in the context of information retrieval [6, 21, 26]<2>, multimedia similarity search [10, 11, 12, 22]<2>, text and data integration [***, 18]<2>, business analytics [1]<2>, preference queries over product catalogs and internet-based recommendation sources [3, 22]<2>, distributed aggregation of network logs and sensor data [7]<2>, and many other important application areas. such queries evaluate search conditions over multiple attributes or text keywords, assign a numeric score that reflects the similarity or relevance of a candidate record or document for each condition, then combine these scores by a monotonic aggregation function such as weighted summation, and finally return the top-k results that have the highest total scores. additionally, for promising candidates, unknown scores for some attributes can be looked up with random accesses, making the score bounds more precise. when scanning multiple index lists (over attributes from one or more relations or document collections), top-k query processing faces an optimization problem: combining each pair of indexes is essentially an equi-join (via equality of the tuple or document ids in matching index entries), and we thus need to solve a join ordering problem [8, ***, 20]<2>. as top-k queries are eventually interested only in the highestscore results, the problem is not just standard join ordering but has additional complexity. as top-k queries are eventually interested only in the highestscore results, the problem is not just standard join ordering but has additional complexity. [***]<2> have called this issue the problem of finding optimal rank-join execution plans. their approach is based on a dbms-oriented compile-time view: they consider only binary rank joins and a join tree to combine the index lists for all attributes or keywords of the query, and they generate the execution plan before query execution starts. this is the approach that we pursue in this paper. in contrast to [2, ***]<2> we do not restrict ourselves to trees of binary joins, but consider all index lists relevant to the query together. 475 list 1 list 2 list 3 doc17 : 0 influence:2 type:2,3 pair index:756 citer id:665 citer title:io-top-k: index-acces optimized top-k query processing citer abstract:top-k query processing is an important building block for ranked retrieval, with applications ranging from text and data integration to distributed aggregation of network logs and sensor data. top-k queries operate on index lists for a querys elementary conditions and aggregate scores for result candidates. one of the best implementation methods in this setting is the family of threshold algorithms, which aim to terminate the index scans as early as possible based on lower and upper bounds for the final scores of result candidates. this procedure performs sequential disk accesses for sorted index scans, but also has the option of performing random accesses to resolve score uncertainty. this entails scheduling for the two kinds of accesses: 1) the prioritization of different index lists in the sequential accesses, and 2) the decision on when to perform random accesses and for which candidates. the prior literature has studied some of these scheduling issues, but only for each of the two access types in isolation. the current paper takes an integrated view of the scheduling issues and develops novel strategies that outperform prior proposals by a large margin. our main contributions are new, principled, scheduling methods based on a knapsackrelated optimization for sequential accesses and a cost model for random accesses. the methods can be further boosted by harnessing probabilistic estimators for scores, selectivities, and index list correlations. in performance experiments with three different datasets (trec terabyte, http server logs, and imdb), our methods achieved significant performance gains compared to the best previously known methods citee id:671 citee title:rank-aware query optimization citee abstract:ranking is an important property that needs to be fully supported by current relational query enginesfi recently, several rank-join query operators have been proposed based on rank aggregation algorithmsfi rank-join operators progressively rank the join results while performing the join operationfi the new operators have a direct impact on traditional query processing and optimizationfiwe introduce a rank-aware query optimization framework that fully integrates rank-join operators into relational query enginesfi the framework is based on extending the system r dynamic programming algorithm in both enumeration and pruningfi we define ranking as an interesting property that triggers the generation of rank-aware query plansfi unlike traditional join operators, optimizing for rank-join operators depends on estimating the input cardinality of these operatorsfi we introduce a probabilistic model for estimating the input cardinality, and hence the cost of a rank-join operatorfi to our knowledge, this paper is the first effort in estimating the needed input size for optimal rank aggregation algorithmsfi costing ranking plans, although challenging, is key to the full integration of rank-join operators in real-world query processing enginesfi we experimentally evaluate our framework by modifying the query optimizer of an open-source database management systemfi the experiments show the validity of our framework and the accuracy of the proposed estimation modelfi surrounding text:our topx work on xml ir [28]<3> included specific scheduling aspects for resolving structural path conditions, but did not consider the more general problem of integrated scheduling for sas and ras. the ranksql work [***, 20]<2> considers the order of binary rank joins at query-planning time. thus, at query run-time there is no flexible scheduling anymore. thus, at query run-time there is no flexible scheduling anymore. for the planningtime optimization, ranksql uses simple statistical models, assuming that scores within a list follow a normal distribution [***]<2>. this assumption is made for tractability, to simplify convolutions influence:3 type:2 pair index:757 citer id:665 citer title:io-top-k: index-acces optimized top-k query processing citer abstract:top-k query processing is an important building block for ranked retrieval, with applications ranging from text and data integration to distributed aggregation of network logs and sensor data. top-k queries operate on index lists for a querys elementary conditions and aggregate scores for result candidates. one of the best implementation methods in this setting is the family of threshold algorithms, which aim to terminate the index scans as early as possible based on lower and upper bounds for the final scores of result candidates. this procedure performs sequential disk accesses for sorted index scans, but also has the option of performing random accesses to resolve score uncertainty. this entails scheduling for the two kinds of accesses: 1) the prioritization of different index lists in the sequential accesses, and 2) the decision on when to perform random accesses and for which candidates. the prior literature has studied some of these scheduling issues, but only for each of the two access types in isolation. the current paper takes an integrated view of the scheduling issues and develops novel strategies that outperform prior proposals by a large margin. our main contributions are new, principled, scheduling methods based on a knapsackrelated optimization for sequential accesses and a cost model for random accesses. the methods can be further boosted by harnessing probabilistic estimators for scores, selectivities, and index list correlations. in performance experiments with three different datasets (trec terabyte, http server logs, and imdb), our methods achieved significant performance gains compared to the best previously known methods citee id:672 citee title:the history of histograms citee abstract:the history of histograms is long and rich, full of detailed information in every step. it includes the course of histograms in di erent scientific fields, the successes and failures of histograms in approximating and compressing information, their adoption by industry, and solutions that have been given on a great variety of histogram-related problems. in this paper and in the same spirit of the histogram techniques themselves, we compress their entire history (including their \future history" as currently anticipated) in the given/fixed space budget, mostly recording details for the periods, events, and results with the highest (personally-biased) interest. in a limited set of experiments, the semantic distance between the compressed and the full form of the history was found relatively small! surrounding text:as we dont know the actual distribution of the si unless we have read the whole list, we have to model or approximate the distribution. we use histograms [***]<1> as an efficient and commonly used means to compactly capture arbitrary score distributions. in our application, we precompute a histogram for the score distribution of each index list, discretizing the score domain for each index list into h buckets with lower buckets bounds s1, influence:3 type:3 pair index:758 citer id:665 citer title:io-top-k: index-acces optimized top-k query processing citer abstract:top-k query processing is an important building block for ranked retrieval, with applications ranging from text and data integration to distributed aggregation of network logs and sensor data. top-k queries operate on index lists for a querys elementary conditions and aggregate scores for result candidates. one of the best implementation methods in this setting is the family of threshold algorithms, which aim to terminate the index scans as early as possible based on lower and upper bounds for the final scores of result candidates. this procedure performs sequential disk accesses for sorted index scans, but also has the option of performing random accesses to resolve score uncertainty. this entails scheduling for the two kinds of accesses: 1) the prioritization of different index lists in the sequential accesses, and 2) the decision on when to perform random accesses and for which candidates. the prior literature has studied some of these scheduling issues, but only for each of the two access types in isolation. the current paper takes an integrated view of the scheduling issues and develops novel strategies that outperform prior proposals by a large margin. our main contributions are new, principled, scheduling methods based on a knapsackrelated optimization for sequential accesses and a cost model for random accesses. the methods can be further boosted by harnessing probabilistic estimators for scores, selectivities, and index list correlations. in performance experiments with three different datasets (trec terabyte, http server logs, and imdb), our methods achieved significant performance gains compared to the best previously known methods citee id:673 citee title:on the integration of structure indexes and inverted lists citee abstract:several methods have been proposed to evaluate queries over a native xml dbms, where the queries specify both path and keyword constraints. these broadly consist of graph traversal approaches, optimized with auxiliary structures known as structure indexes; and approaches based on information-retrieval style inverted lists. we propose a strategy that combines the two forms of auxiliary indexes, and a query evaluation algorithm for branching path expressions based on this strategy. our technique is general and applicable for a wide range of choices of structure indexes and inverted list join algorithms. our experiments over the niagara xml dbms show the benefit of integrating the two forms of indexes. we also consider algorithmic issues in evaluating path expression queries when the notion of relevance ranking is incorporated. by integrating the above techniques with the threshold algorithm proposed by fagin et al., we obtain instance optimal algorithms to push down top k computation. surrounding text:introduction 1. 1 motivation top-k query processing is a key building block for data discovery and ranking and has been intensively studied in the context of information retrieval [6, 21, 26]<2>, multimedia similarity search [10, 11, 12, 22]<2>, text and data integration [15, ***]<2>, business analytics [1]<2>, preference queries over product catalogs and internet-based recommendation sources [3, 22]<2>, distributed aggregation of network logs and sensor data [7]<2>, and many other important application areas. such queries evaluate search conditions over multiple attributes or text keywords, assign a numeric score that reflects the similarity or relevance of a candidate record or document for each condition, then combine these scores by a monotonic aggregation function such as weighted summation, and finally return the top-k results that have the highest total scores. future work could investigate the combination of our approach with approximative pruning strategies. also the extension of our index access scheduling framework for processing xml data along the lines of [***, 23]<2> would be very interesting. 8 influence:2 type:2,3 pair index:759 citer id:665 citer title:io-top-k: index-acces optimized top-k query processing citer abstract:top-k query processing is an important building block for ranked retrieval, with applications ranging from text and data integration to distributed aggregation of network logs and sensor data. top-k queries operate on index lists for a querys elementary conditions and aggregate scores for result candidates. one of the best implementation methods in this setting is the family of threshold algorithms, which aim to terminate the index scans as early as possible based on lower and upper bounds for the final scores of result candidates. this procedure performs sequential disk accesses for sorted index scans, but also has the option of performing random accesses to resolve score uncertainty. this entails scheduling for the two kinds of accesses: 1) the prioritization of different index lists in the sequential accesses, and 2) the decision on when to perform random accesses and for which candidates. the prior literature has studied some of these scheduling issues, but only for each of the two access types in isolation. the current paper takes an integrated view of the scheduling issues and develops novel strategies that outperform prior proposals by a large margin. our main contributions are new, principled, scheduling methods based on a knapsackrelated optimization for sequential accesses and a cost model for random accesses. the methods can be further boosted by harnessing probabilistic estimators for scores, selectivities, and index list correlations. in performance experiments with three different datasets (trec terabyte, http server logs, and imdb), our methods achieved significant performance gains compared to the best previously known methods citee id:476 citee title:space-limited ranked query evaluation using adaptive pruning citee abstract: evaluation of ranked queries on large text collections can be costly in terms of processing time and memory space dynamic pruning techniques allow both costs to be reduced, at the potential risk of decreased retrieval effectiveness in this paper we describe an improved query pruning mechanism that offers a more resilient tradeoff between query evaluation costs and retrieval effectiveness than do previous pruning approaches surrounding text:conclusions this paper presents a comprehensive algorithmic framework and extensive experimentation for various data collections and system setups to address the problem of index access scheduling in top-k query processing. unlike more aggressive pruning strategies proposed in the literature [***, 24, 29]<2> that provide approximate top-k results, the methods we presented here are non-approximative and achieve major runtime gains of factors up to 5 over existing state-of-theart approaches with no loss in result precision. moreover, we show that already the simpler methods of our framework, coined the last strategies, provide the largest contribution to this improvement, and the probabilistic extensions get very close to a lower bound for the optimum cost influence:2 type:2 pair index:760 citer id:665 citer title:io-top-k: index-acces optimized top-k query processing citer abstract:top-k query processing is an important building block for ranked retrieval, with applications ranging from text and data integration to distributed aggregation of network logs and sensor data. top-k queries operate on index lists for a querys elementary conditions and aggregate scores for result candidates. one of the best implementation methods in this setting is the family of threshold algorithms, which aim to terminate the index scans as early as possible based on lower and upper bounds for the final scores of result candidates. this procedure performs sequential disk accesses for sorted index scans, but also has the option of performing random accesses to resolve score uncertainty. this entails scheduling for the two kinds of accesses: 1) the prioritization of different index lists in the sequential accesses, and 2) the decision on when to perform random accesses and for which candidates. the prior literature has studied some of these scheduling issues, but only for each of the two access types in isolation. the current paper takes an integrated view of the scheduling issues and develops novel strategies that outperform prior proposals by a large margin. our main contributions are new, principled, scheduling methods based on a knapsackrelated optimization for sequential accesses and a cost model for random accesses. the methods can be further boosted by harnessing probabilistic estimators for scores, selectivities, and index list correlations. in performance experiments with three different datasets (trec terabyte, http server logs, and imdb), our methods achieved significant performance gains compared to the best previously known methods citee id:674 citee title:ranksql: query algebra and optimization for relational top-k queries citee abstract:this paper introduces ranksql, a system that provides a system-atic and principled framework to support efficient evaluations ofranking (top-k) queries in relational database systems (rdbms),by extending relational algebra and query optimization. previously,top-k query processing is studied in the middleware scenario or inrdbms in a piecemeal fashion, i.e., focusing on specific oper-ator or sitting outside the core of query engines. in contrast, weaim to support ranking as a first-class database construct. as akey insight, the new ranking relationship can be viewed as anotherlogical property of data, parallel to the membership property ofrelational data model. while membership is essentially supportedin rdbms, the same support for ranking is clearly lacking. we ad-dress the fundamental integration of ranking in rdbms in a waysimilar to how membership, i.e., boolean filtering, is supported.we extend relational algebra by proposing a rank-relational modelto capture the ranking property, and introducing new and extendedoperators to support ranking as a first-class construct. enabled bythe extended algebra, we present a pipelined and incremental exe-cution model of ranking query plans (that cannot be expressed tra-ditionally) based on a fundamental ranking principle. to optimizetop-k queries, we propose a dimensional enumeration algorithm toexplore the extended plan space by enumerating plans along twodual dimensions: ranking and membership. we also propose asampling-based method to estimate the cardinality of rank-awareoperators, for costing plans. our experiments show the validity ofour framework and the accuracy of the proposed estimation model. surrounding text:additionally, for promising candidates, unknown scores for some attributes can be looked up with random accesses, making the score bounds more precise. when scanning multiple index lists (over attributes from one or more relations or document collections), top-k query processing faces an optimization problem: combining each pair of indexes is essentially an equi-join (via equality of the tuple or document ids in matching index entries), and we thus need to solve a join ordering problem [8, 15, ***]<2>. as top-k queries are eventually interested only in the highestscore results, the problem is not just standard join ordering but has additional complexity. we realize that this may not always be possible, for example, when a sql query with a stop-after clause uses non-indexed attributes in the order-by clause. the latter situation may arise, for example, when expensive user-defined predicates are involved in the query [9, 10, ***]<2> (e. g. our topx work on xml ir [28]<3> included specific scheduling aspects for resolving structural path conditions, but did not consider the more general problem of integrated scheduling for sas and ras. the ranksql work [16, ***]<2> considers the order of binary rank joins at query-planning time. thus, at query run-time there is no flexible scheduling anymore influence:1 type:2 pair index:761 citer id:665 citer title:io-top-k: index-acces optimized top-k query processing citer abstract:top-k query processing is an important building block for ranked retrieval, with applications ranging from text and data integration to distributed aggregation of network logs and sensor data. top-k queries operate on index lists for a querys elementary conditions and aggregate scores for result candidates. one of the best implementation methods in this setting is the family of threshold algorithms, which aim to terminate the index scans as early as possible based on lower and upper bounds for the final scores of result candidates. this procedure performs sequential disk accesses for sorted index scans, but also has the option of performing random accesses to resolve score uncertainty. this entails scheduling for the two kinds of accesses: 1) the prioritization of different index lists in the sequential accesses, and 2) the decision on when to perform random accesses and for which candidates. the prior literature has studied some of these scheduling issues, but only for each of the two access types in isolation. the current paper takes an integrated view of the scheduling issues and develops novel strategies that outperform prior proposals by a large margin. our main contributions are new, principled, scheduling methods based on a knapsackrelated optimization for sequential accesses and a cost model for random accesses. the methods can be further boosted by harnessing probabilistic estimators for scores, selectivities, and index list correlations. in performance experiments with three different datasets (trec terabyte, http server logs, and imdb), our methods achieved significant performance gains compared to the best previously known methods citee id:675 citee title:optimized query execution in large search engines with global page ordering citee abstract:large web search engines have to answer thousands of queries per second with interactive response times. a major factor in the cost of executing a query is given by the lengths of the inverted lists for the query terms, which increase with the size of the document collection and are often in the range of many megabytes. to address this issue, ir and database researchers have proposed pruning techniques that compute or approximate term-based ranking functions without scanning over the full inverted lists. over the last few years, search engines have incorporated new types of ranking techniques that exploit aspects such as the hyperlink structure of the web or the popularity of a page to obtain improved results. we focus on the question of how such techniques can be efficiently integrated into query processing. in particular, we study ʿҲʶpruning techniques for query execution in large engines in the case where we have a global ranking of pages, as provided by pagerank or any other method, in addition to the standard term-based approach. we describe pruning schemes for this case and evaluate their efficiency on an experimental clusterbased search engine with  million web pages. our results show that there is significant potential benefit in such techniques. surrounding text:introduction 1. 1 motivation top-k query processing is a key building block for data discovery and ranking and has been intensively studied in the context of information retrieval [6, ***, 26]<2>, multimedia similarity search [10, 11, 12, 22]<2>, text and data integration [15, 18]<2>, business analytics [1]<2>, preference queries over product catalogs and internet-based recommendation sources [3, 22]<2>, distributed aggregation of network logs and sensor data [7]<2>, and many other important application areas. such queries evaluate search conditions over multiple attributes or text keywords, assign a numeric score that reflects the similarity or relevance of a candidate record or document for each condition, then combine these scores by a monotonic aggregation function such as weighted summation, and finally return the top-k results that have the highest total scores influence:2 type:2,3 pair index:762 citer id:665 citer title:io-top-k: index-acces optimized top-k query processing citer abstract:top-k query processing is an important building block for ranked retrieval, with applications ranging from text and data integration to distributed aggregation of network logs and sensor data. top-k queries operate on index lists for a querys elementary conditions and aggregate scores for result candidates. one of the best implementation methods in this setting is the family of threshold algorithms, which aim to terminate the index scans as early as possible based on lower and upper bounds for the final scores of result candidates. this procedure performs sequential disk accesses for sorted index scans, but also has the option of performing random accesses to resolve score uncertainty. this entails scheduling for the two kinds of accesses: 1) the prioritization of different index lists in the sequential accesses, and 2) the decision on when to perform random accesses and for which candidates. the prior literature has studied some of these scheduling issues, but only for each of the two access types in isolation. the current paper takes an integrated view of the scheduling issues and develops novel strategies that outperform prior proposals by a large margin. our main contributions are new, principled, scheduling methods based on a knapsackrelated optimization for sequential accesses and a cost model for random accesses. the methods can be further boosted by harnessing probabilistic estimators for scores, selectivities, and index list correlations. in performance experiments with three different datasets (trec terabyte, http server logs, and imdb), our methods achieved significant performance gains compared to the best previously known methods citee id:248 citee title:evaluating top-k queries over web-accessible databases citee abstract:a query to a web search engine usually consists of a list of keywords, to which the search engine responds with the best or top k pages for the query. this top-k query model is prevalent over multimedia collections in general, but also over plain relational data for certain applications. for example, consider a relation with information on available restaurants, including their location, price range for one diner, and overall food rating. a user who queries such a relation might simply specify the users location and target price range, and expect in return the best 10 restaurants in terms of some combination of proximity to the user, closeness of match to the target price range, and overall food rating. processing such top-k queries efficiently is challenging for a number of reasons. one critical such reason is that, in many web applications, the relation attributes might not be available other than through external web-accessible form interfaces, which we will have to query repeatedly for a potentially large set of candidate objects. in this paper, we study how to process topk queries efficiently in this setting, where the attributes for which users specify target values might be handled by external, autonomous sources with a variety of access interfaces. we present several algorithms for processing such queries, and evaluate them thoroughly using both synthetic and real web-accessible data. surrounding text:introduction 1. 1 motivation top-k query processing is a key building block for data discovery and ranking and has been intensively studied in the context of information retrieval [6, 21, 26]<2>, multimedia similarity search [10, 11, 12, ***]<2>, text and data integration [15, 18]<2>, business analytics [1]<2>, preference queries over product catalogs and internet-based recommendation sources [3, ***]<2>, distributed aggregation of network logs and sensor data [7]<2>, and many other important application areas. such queries evaluate search conditions over multiple attributes or text keywords, assign a numeric score that reflects the similarity or relevance of a candidate record or document for each condition, then combine these scores by a monotonic aggregation function such as weighted summation, and finally return the top-k results that have the highest total scores. introduction 1. 1 motivation top-k query processing is a key building block for data discovery and ranking and has been intensively studied in the context of information retrieval [6, 21, 26]<2>, multimedia similarity search [10, 11, 12, ***]<2>, text and data integration [15, 18]<2>, business analytics [1]<2>, preference queries over product catalogs and internet-based recommendation sources [3, ***]<2>, distributed aggregation of network logs and sensor data [7]<2>, and many other important application areas. such queries evaluate search conditions over multiple attributes or text keywords, assign a numeric score that reflects the similarity or relevance of a candidate record or document for each condition, then combine these scores by a monotonic aggregation function such as weighted summation, and finally return the top-k results that have the highest total scores. this leads to preferring sas on index lists with steep gradient [14]<2>. 476 [9]<2> and [5, ***]<2> developed the strategies mpro, upper, and pick for scheduling ras on expensive predicates. they considered restricted attribute sources, such as non-indexed attributes or internet sites that do not support sorted access at all (e. , a streetfinder site that computes driving distances and times), and showed how to integrate these sources into a threshold algorithm. [***]<2> also considered sources with widely different ra costs or widely different sa costs (e. g. our computational model differs from these settings in that we assume that all attributes are indexes with support for both sa and ra and that all index lists are on the same server and thus have identical access costs. for our setting, mpro [9]<2> is essentially the same as the upper method developed in [5, ***]<2>. upper alternates between ra and sa steps. it presents stress tests and large-scale performance experiments that demonstrate the viability and significant benefits of the proposed scheduling strategies. on three different datasets (trec terabyte, http server logs, and imdb), our methods achieve significant performance gains compared to the best previously known methods, fagins combined algorithm (ca) and variants of the upper and pick [5, ***]<2> algorithms: a factor of up to 3 in terms of abstract execution costs, and a factor of 5 in terms of absolute run-times of our implementation. we also show that our best techniques are within 20 percent of a lower bound for the execution cost of any top-k algorithm from the ta family. 5 to assess how close our algorithms get to the optimum. we also ran our experiments for the ra-extensive threshold algorithms ta, upper[5, ***]<2> and pick[5]<2>. in our setting, where both sorted and random access is possible and a random access is much more expensive than a sorted access (the lowest ratio we consider is 100), all these methods performed considerably worse than even the full merge baseline, in terms of both costs and running times, and for all values of k and cr/cs we considered influence:1 type:2 pair index:763 citer id:665 citer title:io-top-k: index-acces optimized top-k query processing citer abstract:top-k query processing is an important building block for ranked retrieval, with applications ranging from text and data integration to distributed aggregation of network logs and sensor data. top-k queries operate on index lists for a querys elementary conditions and aggregate scores for result candidates. one of the best implementation methods in this setting is the family of threshold algorithms, which aim to terminate the index scans as early as possible based on lower and upper bounds for the final scores of result candidates. this procedure performs sequential disk accesses for sorted index scans, but also has the option of performing random accesses to resolve score uncertainty. this entails scheduling for the two kinds of accesses: 1) the prioritization of different index lists in the sequential accesses, and 2) the decision on when to perform random accesses and for which candidates. the prior literature has studied some of these scheduling issues, but only for each of the two access types in isolation. the current paper takes an integrated view of the scheduling issues and develops novel strategies that outperform prior proposals by a large margin. our main contributions are new, principled, scheduling methods based on a knapsackrelated optimization for sequential accesses and a cost model for random accesses. the methods can be further boosted by harnessing probabilistic estimators for scores, selectivities, and index list correlations. in performance experiments with three different datasets (trec terabyte, http server logs, and imdb), our methods achieved significant performance gains compared to the best previously known methods citee id:200 citee title:adaptive processing of top-k queries in xml citee abstract: the ability to compute top - k matches to xml queries is gaining importance due to the increasing number of large surrounding text:future work could investigate the combination of our approach with approximative pruning strategies. also the extension of our index access scheduling framework for processing xml data along the lines of [18, ***]<2> would be very interesting. 8 influence:3 type:3 pair index:764 citer id:665 citer title:io-top-k: index-acces optimized top-k query processing citer abstract:top-k query processing is an important building block for ranked retrieval, with applications ranging from text and data integration to distributed aggregation of network logs and sensor data. top-k queries operate on index lists for a querys elementary conditions and aggregate scores for result candidates. one of the best implementation methods in this setting is the family of threshold algorithms, which aim to terminate the index scans as early as possible based on lower and upper bounds for the final scores of result candidates. this procedure performs sequential disk accesses for sorted index scans, but also has the option of performing random accesses to resolve score uncertainty. this entails scheduling for the two kinds of accesses: 1) the prioritization of different index lists in the sequential accesses, and 2) the decision on when to perform random accesses and for which candidates. the prior literature has studied some of these scheduling issues, but only for each of the two access types in isolation. the current paper takes an integrated view of the scheduling issues and develops novel strategies that outperform prior proposals by a large margin. our main contributions are new, principled, scheduling methods based on a knapsackrelated optimization for sequential accesses and a cost model for random accesses. the methods can be further boosted by harnessing probabilistic estimators for scores, selectivities, and index list correlations. in performance experiments with three different datasets (trec terabyte, http server logs, and imdb), our methods achieved significant performance gains compared to the best previously known methods citee id:43 citee title:self-indexing inverted files for fast text retrieval citee abstract:query processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. here we show that query response time for conjunctive boolean queries and for informal ranked queries can be dramatically reduced, at little cost in terms of storage, by the inclusion of an internal index in each inverted list. this method has been applied in a retrieval system for a collection of nearly two million short documents. our experimental results show that the selfindexing strategy adds less than 20% to the size of the inverted file, but, for boolean queries of 5{10 terms, can reduce processing time to under one fifth of the previous cost. similarly, ranked queries of 40{50 terms can be evaluated in as little as 25% of the previous time, with little or no loss of retrieval e ectiveness surrounding text:conclusions this paper presents a comprehensive algorithmic framework and extensive experimentation for various data collections and system setups to address the problem of index access scheduling in top-k query processing. unlike more aggressive pruning strategies proposed in the literature [19, ***, 29]<2> that provide approximate top-k results, the methods we presented here are non-approximative and achieve major runtime gains of factors up to 5 over existing state-of-theart approaches with no loss in result precision. moreover, we show that already the simpler methods of our framework, coined the last strategies, provide the largest contribution to this improvement, and the probabilistic extensions get very close to a lower bound for the optimum cost influence:1 type:2 pair index:765 citer id:665 citer title:io-top-k: index-acces optimized top-k query processing citer abstract:top-k query processing is an important building block for ranked retrieval, with applications ranging from text and data integration to distributed aggregation of network logs and sensor data. top-k queries operate on index lists for a querys elementary conditions and aggregate scores for result candidates. one of the best implementation methods in this setting is the family of threshold algorithms, which aim to terminate the index scans as early as possible based on lower and upper bounds for the final scores of result candidates. this procedure performs sequential disk accesses for sorted index scans, but also has the option of performing random accesses to resolve score uncertainty. this entails scheduling for the two kinds of accesses: 1) the prioritization of different index lists in the sequential accesses, and 2) the decision on when to perform random accesses and for which candidates. the prior literature has studied some of these scheduling issues, but only for each of the two access types in isolation. the current paper takes an integrated view of the scheduling issues and develops novel strategies that outperform prior proposals by a large margin. our main contributions are new, principled, scheduling methods based on a knapsackrelated optimization for sequential accesses and a cost model for random accesses. the methods can be further boosted by harnessing probabilistic estimators for scores, selectivities, and index list correlations. in performance experiments with three different datasets (trec terabyte, http server logs, and imdb), our methods achieved significant performance gains compared to the best previously known methods citee id:527 citee title:query processing issues in image (multimedia) databases citee abstract:multimedia databases have attracted academic and industrialinterest, and systems such as qbic (content basedimage retrieval system from ibm) have been releasedfi suchsystems are essential to effectively and efficiently use theexisting large collections of image data in the modern computingenvironmentfi the aim of such systems is to enableretrieval of images based on their contentsfi this problemhas brought together the (decades old) database and imageprocessing communitiesfias part of our surrounding text:such queries evaluate search conditions over multiple attributes or text keywords, assign a numeric score that reflects the similarity or relevance of a candidate record or document for each condition, then combine these scores by a monotonic aggregation function such as weighted summation, and finally return the top-k results that have the highest total scores. the method that has been most strongly advocated in recent years is the family of threshold algorithms (ta) [12, 14, ***]<2> that perform index scans over precomputed index lists, one for each attribute or keyword in the query, which are sorted in descending order of per-attribute or per-keyword scores. the key point of ta is that it aggregates scores on the fly, thus computes a lower bound for the total score of the current rank-k result record (document) and an upper bound for the total scores of all other candidate records (documents), and is thus often able to terminate the index scans long before it reaches the bottom of the index lists, namely, when the lower bound for the rank-k result, the threshold, is at least as high as the upper bound for all other candidates influence:2 type:2 pair index:766 citer id:665 citer title:io-top-k: index-acces optimized top-k query processing citer abstract:top-k query processing is an important building block for ranked retrieval, with applications ranging from text and data integration to distributed aggregation of network logs and sensor data. top-k queries operate on index lists for a querys elementary conditions and aggregate scores for result candidates. one of the best implementation methods in this setting is the family of threshold algorithms, which aim to terminate the index scans as early as possible based on lower and upper bounds for the final scores of result candidates. this procedure performs sequential disk accesses for sorted index scans, but also has the option of performing random accesses to resolve score uncertainty. this entails scheduling for the two kinds of accesses: 1) the prioritization of different index lists in the sequential accesses, and 2) the decision on when to perform random accesses and for which candidates. the prior literature has studied some of these scheduling issues, but only for each of the two access types in isolation. the current paper takes an integrated view of the scheduling issues and develops novel strategies that outperform prior proposals by a large margin. our main contributions are new, principled, scheduling methods based on a knapsackrelated optimization for sequential accesses and a cost model for random accesses. the methods can be further boosted by harnessing probabilistic estimators for scores, selectivities, and index list correlations. in performance experiments with three different datasets (trec terabyte, http server logs, and imdb), our methods achieved significant performance gains compared to the best previously known methods citee id:44 citee title:filtered document retrieval with frequency-sorted indexes citee abstract:ranking techniques are e ective at finding answers in document collections but can be expensive to evaluate. we propose an evaluation technique that uses early recognition of which documents are likely to be highly ranked to reduce costs; for our test data, queries are evaluated in 2% of the memory of the standard implementation without degradation in retrieval e ectiveness. cpu time and disk traffic can also be dramatically reduced by designing inverted indexes explicitly to support the technique. the principle of the index design is that inverted lists are sorted by decreasing within-document frequency rather than by document number, and this method experimentally reduces cpu time and disk traffic to around one third of the original requirement. we also show that frequency sorting can lead to a net reduction in index size, regardless of whether the index is compressed surrounding text:introduction 1. 1 motivation top-k query processing is a key building block for data discovery and ranking and has been intensively studied in the context of information retrieval [6, 21, ***]<2>, multimedia similarity search [10, 11, 12, 22]<2>, text and data integration [15, 18]<2>, business analytics [1]<2>, preference queries over product catalogs and internet-based recommendation sources [3, 22]<2>, distributed aggregation of network logs and sensor data [7]<2>, and many other important application areas. such queries evaluate search conditions over multiple attributes or text keywords, assign a numeric score that reflects the similarity or relevance of a candidate record or document for each condition, then combine these scores by a monotonic aggregation function such as weighted summation, and finally return the top-k results that have the highest total scores influence:2 type:2,3 pair index:767 citer id:665 citer title:io-top-k: index-acces optimized top-k query processing citer abstract:top-k query processing is an important building block for ranked retrieval, with applications ranging from text and data integration to distributed aggregation of network logs and sensor data. top-k queries operate on index lists for a querys elementary conditions and aggregate scores for result candidates. one of the best implementation methods in this setting is the family of threshold algorithms, which aim to terminate the index scans as early as possible based on lower and upper bounds for the final scores of result candidates. this procedure performs sequential disk accesses for sorted index scans, but also has the option of performing random accesses to resolve score uncertainty. this entails scheduling for the two kinds of accesses: 1) the prioritization of different index lists in the sequential accesses, and 2) the decision on when to perform random accesses and for which candidates. the prior literature has studied some of these scheduling issues, but only for each of the two access types in isolation. the current paper takes an integrated view of the scheduling issues and develops novel strategies that outperform prior proposals by a large margin. our main contributions are new, principled, scheduling methods based on a knapsackrelated optimization for sequential accesses and a cost model for random accesses. the methods can be further boosted by harnessing probabilistic estimators for scores, selectivities, and index list correlations. in performance experiments with three different datasets (trec terabyte, http server logs, and imdb), our methods achieved significant performance gains compared to the best previously known methods citee id:219 citee title:an efficient and versatile query engine for topx search citee abstract:: this paper presents a novel engine, coined topx, for efficient ranked retrieval of xml documents over semistructured but nonschematic data collectionsfi the algorithm follows the paradigm of threshold algorithms for top-k query processing with a focus on inexpensive sequential accesses to index lists and only a few judiciously scheduled random accessesfi the difficulties in applying surrounding text:probabilistic cost estimation for top-k queries has been a side issue in the recent work of [30]<2>, but there is no consideration of scheduling issues. our topx work on xml ir [***]<3> included specific scheduling aspects for resolving structural path conditions, but did not consider the more general problem of integrated scheduling for sas and ras. the ranksql work [16, 20]<2> considers the order of binary rank joins at query-planning time influence:3 type:3 pair index:768 citer id:665 citer title:io-top-k: index-acces optimized top-k query processing citer abstract:top-k query processing is an important building block for ranked retrieval, with applications ranging from text and data integration to distributed aggregation of network logs and sensor data. top-k queries operate on index lists for a querys elementary conditions and aggregate scores for result candidates. one of the best implementation methods in this setting is the family of threshold algorithms, which aim to terminate the index scans as early as possible based on lower and upper bounds for the final scores of result candidates. this procedure performs sequential disk accesses for sorted index scans, but also has the option of performing random accesses to resolve score uncertainty. this entails scheduling for the two kinds of accesses: 1) the prioritization of different index lists in the sequential accesses, and 2) the decision on when to perform random accesses and for which candidates. the prior literature has studied some of these scheduling issues, but only for each of the two access types in isolation. the current paper takes an integrated view of the scheduling issues and develops novel strategies that outperform prior proposals by a large margin. our main contributions are new, principled, scheduling methods based on a knapsackrelated optimization for sequential accesses and a cost model for random accesses. the methods can be further boosted by harnessing probabilistic estimators for scores, selectivities, and index list correlations. in performance experiments with three different datasets (trec terabyte, http server logs, and imdb), our methods achieved significant performance gains compared to the best previously known methods citee id:676 citee title:top-k query evaluation with probabilistic guarantees citee abstract:top-k queries based on ranking elements of multidimensional datasets are a fundamental building block for many kinds of information discoveryfi the best known general-purpose algo-rithm for evaluating top-k queries is fagin's threshold algorithm (ta)fi since the user's goal beh ind top-k queries is to identify one or a few relevant and novel data items, it is intriguing to use approximate variants of ta to reduce run-time costsfi this paper introduces a family of approximate top-k algorithms based on probabilistic argumentsfi when scanning index lists of the under-lying multidimensional data space in descending order of local scores, various forms of convolution and derived bounds are employed to predict when it is safe, with high probability, to drop candidate items and to prune the index scansfi the precision and the efficiency of the developed methods are experimentally evaluated based on a large web corpus and a structured data collectionfi 1fiintroduction 1fi1 motivation top-k queries on multidimensional datasets compute the k most relevant or interesting results to a partial-match que-ry, based on similarity scores of attribute values with re-gard to elementary query conditions and a score aggrega-tion function such as weighted summationfi this funda-mental building block for information discovery arises in many important application classes such as 1) web and intranet search engines with scores based on word-occur-rence statistics and possibly combining criteria like text-based relevance, link-based authority, and recency, 2) multimedia similarity search on feature vectors of images, music, or video, or 3) preference queries over structured and semistructured data such as product catalogs or custo-mer support data (the latter having a major text compo-nent as well)fi the best known general-purpose method for top-k queries is fagin's threshold algorithm, also known as ta , which has been independently proposed also by ne-pal et alfi and g??ntzer et alfi fi this method as-sumes that each attribute of the multidimensional data space has an index list by which one can access the data items in descending order of the "local" score for the surrounding text:in the second phase, it makes ra for the missing dimensions of candidates that are chosen similarly to upper. our own recent work [***]<2> has used histograms and dynamic convolutions on score distributions to predict the total score of top-k candidates for more aggressive pruning. the scheduling in that work is standard round-robin, however. conclusions this paper presents a comprehensive algorithmic framework and extensive experimentation for various data collections and system setups to address the problem of index access scheduling in top-k query processing. unlike more aggressive pruning strategies proposed in the literature [19, 24, ***]<2> that provide approximate top-k results, the methods we presented here are non-approximative and achieve major runtime gains of factors up to 5 over existing state-of-theart approaches with no loss in result precision. moreover, we show that already the simpler methods of our framework, coined the last strategies, provide the largest contribution to this improvement, and the probabilistic extensions get very close to a lower bound for the optimum cost influence:1 type:2 pair index:769 citer id:665 citer title:io-top-k: index-acces optimized top-k query processing citer abstract:top-k query processing is an important building block for ranked retrieval, with applications ranging from text and data integration to distributed aggregation of network logs and sensor data. top-k queries operate on index lists for a querys elementary conditions and aggregate scores for result candidates. one of the best implementation methods in this setting is the family of threshold algorithms, which aim to terminate the index scans as early as possible based on lower and upper bounds for the final scores of result candidates. this procedure performs sequential disk accesses for sorted index scans, but also has the option of performing random accesses to resolve score uncertainty. this entails scheduling for the two kinds of accesses: 1) the prioritization of different index lists in the sequential accesses, and 2) the decision on when to perform random accesses and for which candidates. the prior literature has studied some of these scheduling issues, but only for each of the two access types in isolation. the current paper takes an integrated view of the scheduling issues and develops novel strategies that outperform prior proposals by a large margin. our main contributions are new, principled, scheduling methods based on a knapsackrelated optimization for sequential accesses and a cost model for random accesses. the methods can be further boosted by harnessing probabilistic estimators for scores, selectivities, and index list correlations. in performance experiments with three different datasets (trec terabyte, http server logs, and imdb), our methods achieved significant performance gains compared to the best previously known methods citee id:677 citee title:processing of distributed top-k queries citee abstract:ranking-aware queries, or top-k queries, have received much attention recently in various contexts such as web, multimedia retrieval, relational databases, and distributed systems. top-k queries play a critical role in many decision-making related activities such as, identifying interesting objects, network monitoring, load balancing, etc. in this paper, we study the ranking aggregation problem in distributed systems. prior research addressing this problem did not take data distributions into account, simply assuming the uniform data distribution among nodes, which is not realistic for real data sets and is, in general, inefficient. in this paper, we propose three efficient algorithms that consider data distributions in different ways. our extensive experiments demonstrate the advantages of our approaches in terms of bandwidth consumption surrounding text:the scheduling in that work is standard round-robin, however. probabilistic cost estimation for top-k queries has been a side issue in the recent work of [***]<2>, but there is no consideration of scheduling issues. our topx work on xml ir [28]<3> included specific scheduling aspects for resolving structural path conditions, but did not consider the more general problem of integrated scheduling for sas and ras influence:1 type:2 pair index:770 citer id:667 citer title:minimal probing: supporting expensive predicates for top-k queries citer abstract:this paper addresses the problem of evaluating ranked topfi queries with expensive predicates. as major dbmss now all support expensive user-defined predicates for boolean queries, we believe such support for ranked queries will be even more important: first, ranked queries often need to model user-specific concepts of preference, relevance, or similarity, which call for dynamic user-defined functions. second, middleware systems must incorporate external predicates for integrating autonomous sources typically accessible only by per-object queries. third, fuzzy joins are inherently expensive, as they are essentially user-defined operations that dynamically associate multiple relations. these predicates, being dynamically defined or externally accessed, cannot rely on index mechanisms to provide zero-time sorted output, and must instead require per-object probe to evaluate. the current standard sort-merge framework for ranked queries cannot efficiently handle such predicates because it must completely probe all objects, before sorting and merging them to produce topfi answers. to minimize expensive probes, we thus develop the formal principle of necessary probes, which determines if a probe is absolutely required. we then propose algorithm mpro which, by implementing the principle, is provably optimal with minimal probe cost. further, we show that mpro can scale well and can be easily parallelized. our experiments using both a real-estate benchmark database and synthetic datasets show that mpro enables significant probe reduction, which can be orders of magnitude faster than the standard scheme using complete probing citee id:87 citee title:a framework for expressing and combining preferences citee abstract:the advent of the world wide web has created an explosion in the available on-line information. as the range of potential choices expand, the time and effort required to sort through them also expands. we propose a formal framework for expressing and combining user preferences to address this problem. preferences can be used to focus search queries and to order the search results. a preference is expressed by the user for an entity which is described by a set of named fields; each field can take on values from a certain type. the * symbol may be used to match any element of that type. a set of preferences can be combined using a generic combine operator which is instantiated with a value function, thus providing a great deal of flexibility. same preferences can be combined in more than one way and a combination of preferences yields another preference thus providing the closure property. we demonstrate the power of our framework by illustrating how a currently popular personalization system and a real-life application can be realized as special cases of our framework. we also discuss implementation of the framework in a relational setting. surrounding text:a text search engine orders documents by their relevance to query terms. an e-commerce service may sort their products according to a users preference criteria [***]<3> to facilitate purchase decisions. for these applications, permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. g. , as in [***]<3>). as these concepts are inherently imprecise and user (or application) specific, a practical system should support ad-hoc criteria to be specifically defined (by users or application programmers) influence:3 type:3 pair index:771 citer id:667 citer title:minimal probing: supporting expensive predicates for top-k queries citer abstract:this paper addresses the problem of evaluating ranked topfi queries with expensive predicates. as major dbmss now all support expensive user-defined predicates for boolean queries, we believe such support for ranked queries will be even more important: first, ranked queries often need to model user-specific concepts of preference, relevance, or similarity, which call for dynamic user-defined functions. second, middleware systems must incorporate external predicates for integrating autonomous sources typically accessible only by per-object queries. third, fuzzy joins are inherently expensive, as they are essentially user-defined operations that dynamically associate multiple relations. these predicates, being dynamically defined or externally accessed, cannot rely on index mechanisms to provide zero-time sorted output, and must instead require per-object probe to evaluate. the current standard sort-merge framework for ranked queries cannot efficiently handle such predicates because it must completely probe all objects, before sorting and merging them to produce topfi answers. to minimize expensive probes, we thus develop the formal principle of necessary probes, which determines if a probe is absolutely required. we then propose algorithm mpro which, by implementing the principle, is provably optimal with minimal probe cost. further, we show that mpro can scale well and can be easily parallelized. our experiments using both a real-estate benchmark database and synthetic datasets show that mpro enables significant probe reduction, which can be orders of magnitude faster than the standard scheme using complete probing citee id:523 citee title:predicate migration: optimizing queries with expensive predicates citee abstract:the traditional focus of relational query optimization schemes has been on the choice of join methods and join orders. restrictions have typically been handled in query optimizers by "predicate pushdown" rules, which apply restrictions in some random order before as many joins as possible. these rules work under the assumption that restrictions is essentially a zero- time operation. however, today's extensible and object-oriented database systems allow users to define time-consuming functions, which may be used in a query's restriction and join predicates. furthermore, sql has long supported subquery predicates, which may be arbitrarily time-consuming to check. thus restrictions should not be considered zero-time operations, and the model of query optimization must be enhanced. in this paper we develop a theory for moving expensive predicates in a query plan so that the total cost of the plan--including the costs of both joins and restrictions--is minimal. we present an algorithm to implement the theory, as well as results of our implementation in postgres. our experience with the newly enhanced postgres are orders of magnitude faster than plans generated by a traditional query optimizer. the additional complexity of considering expensive predicates during optimization is found to be manageably small. surrounding text:ffi is a user-defined function given at query time, we must invoke the function to evaluate the score for each object. we note that, for boolean queries, similar expensive predicates have been extensively studied in the context of extensible databases [***, 3]<2>. in fact, major dbmss (e. g. , [***, 3]<2>) address processing expensive predicates efficiently. as section 1 discussed, all current major dbmss (e influence:1 type:2 pair index:772 citer id:667 citer title:minimal probing: supporting expensive predicates for top-k queries citer abstract:this paper addresses the problem of evaluating ranked topfi queries with expensive predicates. as major dbmss now all support expensive user-defined predicates for boolean queries, we believe such support for ranked queries will be even more important: first, ranked queries often need to model user-specific concepts of preference, relevance, or similarity, which call for dynamic user-defined functions. second, middleware systems must incorporate external predicates for integrating autonomous sources typically accessible only by per-object queries. third, fuzzy joins are inherently expensive, as they are essentially user-defined operations that dynamically associate multiple relations. these predicates, being dynamically defined or externally accessed, cannot rely on index mechanisms to provide zero-time sorted output, and must instead require per-object probe to evaluate. the current standard sort-merge framework for ranked queries cannot efficiently handle such predicates because it must completely probe all objects, before sorting and merging them to produce topfi answers. to minimize expensive probes, we thus develop the formal principle of necessary probes, which determines if a probe is absolutely required. we then propose algorithm mpro which, by implementing the principle, is provably optimal with minimal probe cost. further, we show that mpro can scale well and can be easily parallelized. our experiments using both a real-estate benchmark database and synthetic datasets show that mpro enables significant probe reduction, which can be orders of magnitude faster than the standard scheme using complete probing citee id:525 citee title:optimizing disjunctive queries with expensive predicates citee abstract:in this work, we propose and assess a technique called bypassprocessing for optimizing the evaluation of disjunctivequeries with expensive predicatesfi the technique is particularlyuseful for optimizing selection predicates that containterms whose evaluation costs vary tremendously; efigfi,the evaluation of a nested subquery or the invocation of auser-defined function in an object-oriented or extended relationalmodel may be orders of magnitude more expensivethan an attribute access (and surrounding text:ffi is a user-defined function given at query time, we must invoke the function to evaluate the score for each object. we note that, for boolean queries, similar expensive predicates have been extensively studied in the context of extensible databases [2, ***]<2>. in fact, major dbmss (e. g. , [2, ***]<2>) address processing expensive predicates efficiently. as section 1 discussed, all current major dbmss (e influence:1 type:2 pair index:773 citer id:667 citer title:minimal probing: supporting expensive predicates for top-k queries citer abstract:this paper addresses the problem of evaluating ranked topfi queries with expensive predicates. as major dbmss now all support expensive user-defined predicates for boolean queries, we believe such support for ranked queries will be even more important: first, ranked queries often need to model user-specific concepts of preference, relevance, or similarity, which call for dynamic user-defined functions. second, middleware systems must incorporate external predicates for integrating autonomous sources typically accessible only by per-object queries. third, fuzzy joins are inherently expensive, as they are essentially user-defined operations that dynamically associate multiple relations. these predicates, being dynamically defined or externally accessed, cannot rely on index mechanisms to provide zero-time sorted output, and must instead require per-object probe to evaluate. the current standard sort-merge framework for ranked queries cannot efficiently handle such predicates because it must completely probe all objects, before sorting and merging them to produce topfi answers. to minimize expensive probes, we thus develop the formal principle of necessary probes, which determines if a probe is absolutely required. we then propose algorithm mpro which, by implementing the principle, is provably optimal with minimal probe cost. further, we show that mpro can scale well and can be easily parallelized. our experiments using both a real-estate benchmark database and synthetic datasets show that mpro enables significant probe reduction, which can be orders of magnitude faster than the standard scheme using complete probing citee id:519 citee title:optimizing queries over multimedia repositories citee abstract:repositories of multimedia objects having multiple types of attributes (e.g., image, text) are becoming increasingly common. a selection on these attributes will typically produce not just a set of objects, as in the traditional relational query model (filtering), but also a grade of match associated with each object, indicating how well the object matches the selection condition (ranking). also, multimedia repositories may allow access to the attributes of each object only through indexes. we investigate how to optimize the processing of queries over multimedia repositories. a key issue is the choice of the indexes used to search the repository. we define an execution space that is search-minimal, i.e., the set of indexes searched is minimal. although the general problem of picking an optimal plan in the search-minimal execution space is np-hard, we solve the problem efficiently when the predicates in the query are independent. we also show that the problem of optimizing queries that ask for a few top-ranked objects can be viewed, in many cases, as that of evaluating selection conditions. thus, both problems can be viewed together as an extended filtering problem. surrounding text:these predicates, be they user defined or externally accessed, can be arbitrarily expensive to probe, potentially requiring complex computation or access to networked sources. note that it may appear straightforward to transform an expensive predicate into a normal one: by probing every object for its score, one can build a search index for the predicate (to access objects scored above a threshold or in the sorted order), as required by the current processing frameworks [***, 5, 6, 7, 8, 9, 10]<2> (section 3. 3). [14, 15]<2> then present optimization techniques for exploiting stop after, which limits the result cardinalities of queries. in this relational context, references [***, 5]<2> study more general ranked queries using scoring functions. in particular, [***]<2> exploits indices for query search, and [5]<2> maps ranked queries into boolean range queries. in this relational context, references [***, 5]<2> study more general ranked queries using scoring functions. in particular, [***]<2> exploits indices for query search, and [5]<2> maps ranked queries into boolean range queries. recently, prefer [6]<2> uses materialized views to evaluate preference queries defined as linear sums of attribute values influence:2 type:2 pair index:774 citer id:667 citer title:minimal probing: supporting expensive predicates for top-k queries citer abstract:this paper addresses the problem of evaluating ranked topfi queries with expensive predicates. as major dbmss now all support expensive user-defined predicates for boolean queries, we believe such support for ranked queries will be even more important: first, ranked queries often need to model user-specific concepts of preference, relevance, or similarity, which call for dynamic user-defined functions. second, middleware systems must incorporate external predicates for integrating autonomous sources typically accessible only by per-object queries. third, fuzzy joins are inherently expensive, as they are essentially user-defined operations that dynamically associate multiple relations. these predicates, being dynamically defined or externally accessed, cannot rely on index mechanisms to provide zero-time sorted output, and must instead require per-object probe to evaluate. the current standard sort-merge framework for ranked queries cannot efficiently handle such predicates because it must completely probe all objects, before sorting and merging them to produce topfi answers. to minimize expensive probes, we thus develop the formal principle of necessary probes, which determines if a probe is absolutely required. we then propose algorithm mpro which, by implementing the principle, is provably optimal with minimal probe cost. further, we show that mpro can scale well and can be easily parallelized. our experiments using both a real-estate benchmark database and synthetic datasets show that mpro enables significant probe reduction, which can be orders of magnitude faster than the standard scheme using complete probing citee id:520 citee title:evaluating top-k selection queries citee abstract:in many applications, users specify target values for certain attributes, without requiring exact matches to these values in returnfi instead, the result to such queries is typically a rank of the "top k" tuples that best match the given attribute valuesfi in this paper, we study the advantages and limitations of processing a top-k query by translating it into a single range query that traditional relational dbmss can process e#cientlyfi in particular, we study how to determine a range query to surrounding text:these predicates, be they user defined or externally accessed, can be arbitrarily expensive to probe, potentially requiring complex computation or access to networked sources. note that it may appear straightforward to transform an expensive predicate into a normal one: by probing every object for its score, one can build a search index for the predicate (to access objects scored above a threshold or in the sorted order), as required by the current processing frameworks [4, ***, 6, 7, 8, 9, 10]<2> (section 3. 3). [14, 15]<2> then present optimization techniques for exploiting stop after, which limits the result cardinalities of queries. in this relational context, references [4, ***]<2> study more general ranked queries using scoring functions. in particular, [4]<2> exploits indices for query search, and [***]<2> maps ranked queries into boolean range queries. in this relational context, references [4, ***]<2> study more general ranked queries using scoring functions. in particular, [4]<2> exploits indices for query search, and [***]<2> maps ranked queries into boolean range queries. recently, prefer [6]<2> uses materialized views to evaluate preference queries defined as linear sums of attribute values influence:2 type:2 pair index:775 citer id:667 citer title:minimal probing: supporting expensive predicates for top-k queries citer abstract:this paper addresses the problem of evaluating ranked topfi queries with expensive predicates. as major dbmss now all support expensive user-defined predicates for boolean queries, we believe such support for ranked queries will be even more important: first, ranked queries often need to model user-specific concepts of preference, relevance, or similarity, which call for dynamic user-defined functions. second, middleware systems must incorporate external predicates for integrating autonomous sources typically accessible only by per-object queries. third, fuzzy joins are inherently expensive, as they are essentially user-defined operations that dynamically associate multiple relations. these predicates, being dynamically defined or externally accessed, cannot rely on index mechanisms to provide zero-time sorted output, and must instead require per-object probe to evaluate. the current standard sort-merge framework for ranked queries cannot efficiently handle such predicates because it must completely probe all objects, before sorting and merging them to produce topfi answers. to minimize expensive probes, we thus develop the formal principle of necessary probes, which determines if a probe is absolutely required. we then propose algorithm mpro which, by implementing the principle, is provably optimal with minimal probe cost. further, we show that mpro can scale well and can be easily parallelized. our experiments using both a real-estate benchmark database and synthetic datasets show that mpro enables significant probe reduction, which can be orders of magnitude faster than the standard scheme using complete probing citee id:187 citee title:a system for the efficient execution of multi-parametric ranked queries citee abstract:users often need to optimize the selection of objects by appropriately weighting the importance of multiple object attributes. such optimization problems appear often in operations research and applied mathematics as well as everyday life; e.g., a buyer may select a home as a weighted function of a number of attributes like its distance from office, its price, its area, etc. we capture such queries in our definition of preference queries that use a weight function over a relations attributes to derive a score for each tuple. database systems cannot efficiently produce the top results of a preference query because they need to evaluate the weight function over all tuples of the relation. prefer answers preference queries efficiently by using materialized views that have been preprocessed and stored. we first show how the result of a preference query can be produced in a pipelined fashion using a materialized view. then we show that excellent performance can be delivered given a reasonable number of materialized views and we provide an algorithm that selects a number of views to precompute and materialize given space constraints. we have implemented the algorithms proposed in this paper in a prototype system called prefer, which operates on top of a commercial database management system. we present the results of a performance comparison, comparing our algorithms with prior approaches using synthetic datasets. our results indicate that the proposed algorithms are superior in performance compared to other approaches, both in preprocessing (preparation of materialized views) as well as execution time. surrounding text:these predicates, be they user defined or externally accessed, can be arbitrarily expensive to probe, potentially requiring complex computation or access to networked sources. note that it may appear straightforward to transform an expensive predicate into a normal one: by probing every object for its score, one can build a search index for the predicate (to access objects scored above a threshold or in the sorted order), as required by the current processing frameworks [4, 5, ***, 7, 8, 9, 10]<2> (section 3. 3). in particular, [4]<2> exploits indices for query search, and [5]<2> maps ranked queries into boolean range queries. recently, prefer [***]<2> uses materialized views to evaluate preference queries defined as linear sums of attribute values. these works assume that scoring functions directly combine attributes (which are essentially search predicates) influence:2 type:2 pair index:776 citer id:667 citer title:minimal probing: supporting expensive predicates for top-k queries citer abstract:this paper addresses the problem of evaluating ranked topfi queries with expensive predicates. as major dbmss now all support expensive user-defined predicates for boolean queries, we believe such support for ranked queries will be even more important: first, ranked queries often need to model user-specific concepts of preference, relevance, or similarity, which call for dynamic user-defined functions. second, middleware systems must incorporate external predicates for integrating autonomous sources typically accessible only by per-object queries. third, fuzzy joins are inherently expensive, as they are essentially user-defined operations that dynamically associate multiple relations. these predicates, being dynamically defined or externally accessed, cannot rely on index mechanisms to provide zero-time sorted output, and must instead require per-object probe to evaluate. the current standard sort-merge framework for ranked queries cannot efficiently handle such predicates because it must completely probe all objects, before sorting and merging them to produce topfi answers. to minimize expensive probes, we thus develop the formal principle of necessary probes, which determines if a probe is absolutely required. we then propose algorithm mpro which, by implementing the principle, is provably optimal with minimal probe cost. further, we show that mpro can scale well and can be easily parallelized. our experiments using both a real-estate benchmark database and synthetic datasets show that mpro enables significant probe reduction, which can be orders of magnitude faster than the standard scheme using complete probing citee id:330 citee title:combining fuzzy information from multiple systems citee abstract:in a traditional database system, the result of a query is a set of values (those values that satisfy the query). in other data servers, such as a system with queries based on image content, or many text retrieval systems, the result of a query is a sorted list. for example, in the case of a system with queries based on image content, the query might ask for objects that are a particular shade of red, and the result of the query would be a sorted list of objects in the database, sorted by how ... surrounding text:these predicates, be they user defined or externally accessed, can be arbitrarily expensive to probe, potentially requiring complex computation or access to networked sources. note that it may appear straightforward to transform an expensive predicate into a normal one: by probing every object for its score, one can build a search index for the predicate (to access objects scored above a threshold or in the sorted order), as required by the current processing frameworks [4, 5, 6, ***, 8, 9, 10]<2> (section 3. 3). top-k queries have been developed recently in two different contexts. first, in a middleware environment, fagin [***, 8]<2> pioneered ranked queries and established the well-known . influence:2 type:2 pair index:777 citer id:667 citer title:minimal probing: supporting expensive predicates for top-k queries citer abstract:this paper addresses the problem of evaluating ranked topfi queries with expensive predicates. as major dbmss now all support expensive user-defined predicates for boolean queries, we believe such support for ranked queries will be even more important: first, ranked queries often need to model user-specific concepts of preference, relevance, or similarity, which call for dynamic user-defined functions. second, middleware systems must incorporate external predicates for integrating autonomous sources typically accessible only by per-object queries. third, fuzzy joins are inherently expensive, as they are essentially user-defined operations that dynamically associate multiple relations. these predicates, being dynamically defined or externally accessed, cannot rely on index mechanisms to provide zero-time sorted output, and must instead require per-object probe to evaluate. the current standard sort-merge framework for ranked queries cannot efficiently handle such predicates because it must completely probe all objects, before sorting and merging them to produce topfi answers. to minimize expensive probes, we thus develop the formal principle of necessary probes, which determines if a probe is absolutely required. we then propose algorithm mpro which, by implementing the principle, is provably optimal with minimal probe cost. further, we show that mpro can scale well and can be easily parallelized. our experiments using both a real-estate benchmark database and synthetic datasets show that mpro enables significant probe reduction, which can be orders of magnitude faster than the standard scheme using complete probing citee id:723 citee title:using fagins algorithm for merging ranked results in multimedia middleware citee abstract:a distributed multimedia information system allows applications to access a variety of data, of different modalities, stored in data sources with their own specialized search capabilities. in such a system, the user can request that a set of objects be ranked by a particular property, or by a combination of properties. in , fagin gives an algorithm for efficiently merging multiple ordered streams of ranked results, to form a new stream ordered by a combination of those ranks surrounding text:these predicates, be they user defined or externally accessed, can be arbitrarily expensive to probe, potentially requiring complex computation or access to networked sources. note that it may appear straightforward to transform an expensive predicate into a normal one: by probing every object for its score, one can build a search index for the predicate (to access objects scored above a threshold or in the sorted order), as required by the current processing frameworks [4, 5, 6, 7, ***, 9, 10]<2> (section 3. 3). top-k queries have been developed recently in two different contexts. first, in a middleware environment, fagin [7, ***]<2> pioneered ranked queries and established the well-known . influence:2 type:2 pair index:778 citer id:667 citer title:minimal probing: supporting expensive predicates for top-k queries citer abstract:this paper addresses the problem of evaluating ranked topfi queries with expensive predicates. as major dbmss now all support expensive user-defined predicates for boolean queries, we believe such support for ranked queries will be even more important: first, ranked queries often need to model user-specific concepts of preference, relevance, or similarity, which call for dynamic user-defined functions. second, middleware systems must incorporate external predicates for integrating autonomous sources typically accessible only by per-object queries. third, fuzzy joins are inherently expensive, as they are essentially user-defined operations that dynamically associate multiple relations. these predicates, being dynamically defined or externally accessed, cannot rely on index mechanisms to provide zero-time sorted output, and must instead require per-object probe to evaluate. the current standard sort-merge framework for ranked queries cannot efficiently handle such predicates because it must completely probe all objects, before sorting and merging them to produce topfi answers. to minimize expensive probes, we thus develop the formal principle of necessary probes, which determines if a probe is absolutely required. we then propose algorithm mpro which, by implementing the principle, is provably optimal with minimal probe cost. further, we show that mpro can scale well and can be easily parallelized. our experiments using both a real-estate benchmark database and synthetic datasets show that mpro enables significant probe reduction, which can be orders of magnitude faster than the standard scheme using complete probing citee id:254 citee title:optimal aggregation algorithms for middleware citee abstract:assume that each object in a database has m grades, or scores, one for eachof m attributes. for example, an object can have a color grade, that tells how red it is, and a shape grade, that tells how round it is. for each attribute, there is a sorted list, which lists each object and its grade under that attribute, sorted by grade (highest grade first). each object is assigned an overall grade, that is obtained by combining the attribute grades using a fixed monotone aggregation function, or combining rule, suchas min or average. to determine the top k objects, that is, k objects with the highest overall grades, the naive algorithm must access every object in the database, to find its grade under each attribute. fagin has given an algorithm (fagins algorithm, or fa) that is much more efficient. for some monotone aggregation functions, fa is optimal with high probability in the worst case. we analyze an elegant and remarkably simple algorithm (the threshold algorithm, or ta) that is optimal in a much stronger sense than fa. we show that ta is essentially optimal, not just for some monotone aggregation functions, but for all of them, and not just in a high-probability worst-case sense, but over every database. unlike fa, which requires large buffers (whose size may grow unboundedly as the database size grows), ta requires only a small, constant-size buffer. ta allows early stopping, which yields, in a precise sense, an approximate version of the top k answers. we distinguish two types of access: sorted access (where the middleware system obtains the grade of an object in some sorted list by proceeding through the list sequentially from the top), and random access (where the middleware system requests the grade of object in a list, and obtains it in one step). we consider the scenarios where random access is either impossible, or expensive relative to sorted access, and provide algorithms that are essentially optimal for these cases as well. r 2003 elsevier science (usa). all rights reserved. surrounding text:these predicates, be they user defined or externally accessed, can be arbitrarily expensive to probe, potentially requiring complex computation or access to networked sources. note that it may appear straightforward to transform an expensive predicate into a normal one: by probing every object for its score, one can build a search index for the predicate (to access objects scored above a threshold or in the sorted order), as required by the current processing frameworks [4, 5, 6, 7, 8, 9, ***]<2> (section 3. 3) influence:2 type:2 pair index:779 citer id:667 citer title:minimal probing: supporting expensive predicates for top-k queries citer abstract:this paper addresses the problem of evaluating ranked topfi queries with expensive predicates. as major dbmss now all support expensive user-defined predicates for boolean queries, we believe such support for ranked queries will be even more important: first, ranked queries often need to model user-specific concepts of preference, relevance, or similarity, which call for dynamic user-defined functions. second, middleware systems must incorporate external predicates for integrating autonomous sources typically accessible only by per-object queries. third, fuzzy joins are inherently expensive, as they are essentially user-defined operations that dynamically associate multiple relations. these predicates, being dynamically defined or externally accessed, cannot rely on index mechanisms to provide zero-time sorted output, and must instead require per-object probe to evaluate. the current standard sort-merge framework for ranked queries cannot efficiently handle such predicates because it must completely probe all objects, before sorting and merging them to produce topfi answers. to minimize expensive probes, we thus develop the formal principle of necessary probes, which determines if a probe is absolutely required. we then propose algorithm mpro which, by implementing the principle, is provably optimal with minimal probe cost. further, we show that mpro can scale well and can be easily parallelized. our experiments using both a real-estate benchmark database and synthetic datasets show that mpro enables significant probe reduction, which can be orders of magnitude faster than the standard scheme using complete probing citee id:667 citee title:minimal probing: supporting expensive predicates for top-k queries citee abstract:this paper addresses the problem of evaluating ranked topfi queries with expensive predicates. as major dbmss now all support expensive user-defined predicates for boolean queries, we believe such support for ranked queries will be even more important: first, ranked queries often need to model user-specific concepts of preference, relevance, or similarity, which call for dynamic user-defined functions. second, middleware systems must incorporate external predicates for integrating autonomous sources typically accessible only by per-object queries. third, fuzzy joins are inherently expensive, as they are essentially user-defined operations that dynamically associate multiple relations. these predicates, being dynamically defined or externally accessed, cannot rely on index mechanisms to provide zero-time sorted output, and must instead require per-object probe to evaluate. the current standard sort-merge framework for ranked queries cannot efficiently handle such predicates because it must completely probe all objects, before sorting and merging them to produce topfi answers. to minimize expensive probes, we thus develop the formal principle of necessary probes, which determines if a probe is absolutely required. we then propose algorithm mpro which, by implementing the principle, is provably optimal with minimal probe cost. further, we show that mpro can scale well and can be easily parallelized. our experiments using both a real-estate benchmark database and synthetic datasets show that mpro enables significant probe reduction, which can be orders of magnitude faster than the standard scheme using complete probing surrounding text:in addition, mpro can be immediately generalized for approximate processing. since approximate answers are often acceptable in ranked queries (which are inherently imprecise), we extend mpro to enable trading efficiency with accuracyc which we report in [***]<1>. note that this paper concentrates on the algorithmic framework for supporting expensive predicates, and not on other related issues. g. , proof, approximation, and more experiments) to an extended report [***]<1>. 2. , parallelization (section 5. 3) and approximation [***]<1>, as well as analytical study of probe scalability (section 5. 4). the same results hold when objects have their own schedules. while such scheduling is np-hard in general [***]<1>, section 4. 3 will discuss an online sampling-based scheduler that effectively (almost always) finds the best schedules by greedily scheduling more cost-effective predicates. ) for any object : in the topfi slots, its next probe pr is necessary. theorem 1 formally states this result (see [***]<1>":rfo3fi rffi4a% proof). theorem 1 (necessary-probe principle): consider a ranked query with scoring function  and retrieval size fi. putting together, we immediately conclude that an algorithm will be probe-optimal (section 3. 2) if it only performs necessary probes, which we formally state in lemma 1 (see [***]<1> for a proof). our goal is thus to design such an algorithm. on the other hand, when p ffi holds, it follows from theorem 1 that no more probes can be necessary, and thus the topfi answers must have fully surfaced, which is indeed the case. (we discuss in [***]<1> how the stop condition can be customized for approximate queries. ) figure 4 illustrates algorithm mpro for our example of finding the top 2 object when   6879,":' , ffi0 , ffi 5 % and  " ffi0 fi3ffi 5. our framework chooses to focus on global scheduling. note that scheduling can be expensive (it is np-hard in the number of predicates, as we show in [***]<1>). as we will see, our approach is essentially global, predicatebypredicate scheduling, using sampling to acquire predicate selectivities (and costs) for constructing a global schedule online at query startup. ) thus the necessary probes of =s is  s = co>p  " s % , and ffi," mpro %  s  bo>p  " s % o fi s  bo s  p  "vs % o fi s our goal is to find a schedule that minimizes ffi " mpro % . however, as our extended report [***]<1> shows, this optimal scheduling problem is np-hard and thus generally intractable. since an exhaustive method may be too expensive and impractical, we propose a greedy algorithm that always picks the most cost-effective predicate with a low aggregate selectivity (thus high filtering rate) and a low cost. 4 analytically develops the scalability of mpro, showing that its cost growth is sub-linear in database size. (in addition, in [***]<1> we also extend mpro for approximate queries. ) 5. we thus obtain an interesting result: if a database is uniformly scaled up  times, mpro can retrieve  times more top answers, with  times more probe cost. while we leave a formal proof to [***]<1>, section 6 experimentally verifies this resultc over datasets that are only similar but not identical. theorem 3 (probe scalability): consider a ranked query 8" ,  ,   ,   % and a schedule. our algorithm is thus provably optimal, based on the necessaryprobe principle. further, we show that mpro can scale well and can be easily parallelized (and it supports approximation [***]<1>). we have implemented the mechanism described in this paper, based on which we performed extensive experiments with both real-life databases and synthetic datasets influence:1 type:2 pair index:780 citer id:667 citer title:minimal probing: supporting expensive predicates for top-k queries citer abstract:this paper addresses the problem of evaluating ranked topfi queries with expensive predicates. as major dbmss now all support expensive user-defined predicates for boolean queries, we believe such support for ranked queries will be even more important: first, ranked queries often need to model user-specific concepts of preference, relevance, or similarity, which call for dynamic user-defined functions. second, middleware systems must incorporate external predicates for integrating autonomous sources typically accessible only by per-object queries. third, fuzzy joins are inherently expensive, as they are essentially user-defined operations that dynamically associate multiple relations. these predicates, being dynamically defined or externally accessed, cannot rely on index mechanisms to provide zero-time sorted output, and must instead require per-object probe to evaluate. the current standard sort-merge framework for ranked queries cannot efficiently handle such predicates because it must completely probe all objects, before sorting and merging them to produce topfi answers. to minimize expensive probes, we thus develop the formal principle of necessary probes, which determines if a probe is absolutely required. we then propose algorithm mpro which, by implementing the principle, is provably optimal with minimal probe cost. further, we show that mpro can scale well and can be easily parallelized. our experiments using both a real-estate benchmark database and synthetic datasets show that mpro enables significant probe reduction, which can be orders of magnitude faster than the standard scheme using complete probing citee id:724 citee title:supporting incremental join queries on ranked inputs citee abstract:this paper investigates the problem of incrementaljoins of multiple ranked data sets when the joincondition is a list of arbitrary user-defined predicateson the input tuplesfi this problem arises inmany important applications dealing with orderedinputs and multiple ranked data sets, and requiringthe top k solutionsfi we use multimedia applicationsas the motivating examples but the problemis equally applicable to traditional database applicationsinvolving optimal resource surrounding text:). [***]<2> generalizes to handling arbitrary  -joins as combining constraints. as section 3 discusses, these works assume sorted access of search predicates. we study general fuzzy joins as arbitrary probe predicates. more recently, around the same time of our work [11]<2>, some related efforts emerge, addressing the problems of  -joins [***]<2> (which are boolean and not fuzzy, as just explained) and external sources [17]<2>. in contrast, we address a more general and unified problem of expensive predicates. in particular, algorithm mpro can be cast as a specialization of  . several other works [17, ***, 16]<2> also adopt the same basis. our work distinguishes itself in several aspects: first, we aim at a different or more general problem (as contrasted above) influence:3 type:2 pair index:781 citer id:667 citer title:minimal probing: supporting expensive predicates for top-k queries citer abstract:this paper addresses the problem of evaluating ranked topfi queries with expensive predicates. as major dbmss now all support expensive user-defined predicates for boolean queries, we believe such support for ranked queries will be even more important: first, ranked queries often need to model user-specific concepts of preference, relevance, or similarity, which call for dynamic user-defined functions. second, middleware systems must incorporate external predicates for integrating autonomous sources typically accessible only by per-object queries. third, fuzzy joins are inherently expensive, as they are essentially user-defined operations that dynamically associate multiple relations. these predicates, being dynamically defined or externally accessed, cannot rely on index mechanisms to provide zero-time sorted output, and must instead require per-object probe to evaluate. the current standard sort-merge framework for ranked queries cannot efficiently handle such predicates because it must completely probe all objects, before sorting and merging them to produce topfi answers. to minimize expensive probes, we thus develop the formal principle of necessary probes, which determines if a probe is absolutely required. we then propose algorithm mpro which, by implementing the principle, is provably optimal with minimal probe cost. further, we show that mpro can scale well and can be easily parallelized. our experiments using both a real-estate benchmark database and synthetic datasets show that mpro enables significant probe reduction, which can be orders of magnitude faster than the standard scheme using complete probing citee id:725 citee title:on saying enough already! in sql citee abstract:in this paper, we study a simple sql extension that enables query writers to explicitly limit the cardinality of a query result. we examine its impact on the query optimization and run-time execution components of a relational dbms, presenting two approachesa conservative approach and an aggressive approachto exploiting cardinality limits in relational query plans. results obtained from an empirical study conducted using db2 demonstrate the benefits of the sql extension and illustrate the tradeoffs between our two approaches to implementing it. surrounding text:carey et al. [***, 15]<2> then present optimization techniques for exploiting stop after, which limits the result cardinalities of queries. in this relational context, references [4, 5]<2> study more general ranked queries using scoring functions influence:3 type:2 pair index:782 citer id:667 citer title:minimal probing: supporting expensive predicates for top-k queries citer abstract:this paper addresses the problem of evaluating ranked topfi queries with expensive predicates. as major dbmss now all support expensive user-defined predicates for boolean queries, we believe such support for ranked queries will be even more important: first, ranked queries often need to model user-specific concepts of preference, relevance, or similarity, which call for dynamic user-defined functions. second, middleware systems must incorporate external predicates for integrating autonomous sources typically accessible only by per-object queries. third, fuzzy joins are inherently expensive, as they are essentially user-defined operations that dynamically associate multiple relations. these predicates, being dynamically defined or externally accessed, cannot rely on index mechanisms to provide zero-time sorted output, and must instead require per-object probe to evaluate. the current standard sort-merge framework for ranked queries cannot efficiently handle such predicates because it must completely probe all objects, before sorting and merging them to produce topfi answers. to minimize expensive probes, we thus develop the formal principle of necessary probes, which determines if a probe is absolutely required. we then propose algorithm mpro which, by implementing the principle, is provably optimal with minimal probe cost. further, we show that mpro can scale well and can be easily parallelized. our experiments using both a real-estate benchmark database and synthetic datasets show that mpro enables significant probe reduction, which can be orders of magnitude faster than the standard scheme using complete probing citee id:250 citee title:reducing the braking distance of an sql query engine citee abstract:in a recent paper, we proposed adding a stop after clause tosql to permit the cardinality of a query result to be explicitly limitedby query writers and query toolsfi we demonstrated the usefulnessof having this clause, showed how to extend a traditionalcost-based query optimizer to accommodate it, and demonstratedvia db2-based simulations that large performance gains are possiblewhen stop after queries are explicitly supported by thedatabase enginefi in this paper, we present several new surrounding text:carey et al. [14, ***]<2> then present optimization techniques for exploiting stop after, which limits the result cardinalities of queries. in this relational context, references [4, 5]<2> study more general ranked queries using scoring functions influence:3 type:2 pair index:783 citer id:667 citer title:minimal probing: supporting expensive predicates for top-k queries citer abstract:this paper addresses the problem of evaluating ranked topfi queries with expensive predicates. as major dbmss now all support expensive user-defined predicates for boolean queries, we believe such support for ranked queries will be even more important: first, ranked queries often need to model user-specific concepts of preference, relevance, or similarity, which call for dynamic user-defined functions. second, middleware systems must incorporate external predicates for integrating autonomous sources typically accessible only by per-object queries. third, fuzzy joins are inherently expensive, as they are essentially user-defined operations that dynamically associate multiple relations. these predicates, being dynamically defined or externally accessed, cannot rely on index mechanisms to provide zero-time sorted output, and must instead require per-object probe to evaluate. the current standard sort-merge framework for ranked queries cannot efficiently handle such predicates because it must completely probe all objects, before sorting and merging them to produce topfi answers. to minimize expensive probes, we thus develop the formal principle of necessary probes, which determines if a probe is absolutely required. we then propose algorithm mpro which, by implementing the principle, is provably optimal with minimal probe cost. further, we show that mpro can scale well and can be easily parallelized. our experiments using both a real-estate benchmark database and synthetic datasets show that mpro enables significant probe reduction, which can be orders of magnitude faster than the standard scheme using complete probing citee id:251 citee title:integration of heterogeneous databases without common domains using queries based on textual similarity citee abstract:most databases contain "name constants" like course numbers, personal names, and place names that correspond to entities in the real worldfi previous work in integration of heterogeneous databases has assumed that local name constants can be mapped into an appropriate global domain by normalizationfi however, in many cases, this assumption does not hold; determining if two name constants should be considered identical can require detailed knowledge of the world, the purpose of the user"s query, surrounding text:this paper identifies and formulates the general problem of supporting expensive predicates for ranked queries, providing unified abstraction for user-defined functions, external predicates, and fuzzy joins. [***]<2> develops an ir-based similarity-join. we study general fuzzy joins as arbitrary probe predicates. in particular, algorithm mpro can be cast as a specialization of  . several other works [17, 12, ***]<2> also adopt the same basis. our work distinguishes itself in several aspects: first, we aim at a different or more general problem (as contrasted above) influence:3 type:2 pair index:784 citer id:667 citer title:minimal probing: supporting expensive predicates for top-k queries citer abstract:this paper addresses the problem of evaluating ranked topfi queries with expensive predicates. as major dbmss now all support expensive user-defined predicates for boolean queries, we believe such support for ranked queries will be even more important: first, ranked queries often need to model user-specific concepts of preference, relevance, or similarity, which call for dynamic user-defined functions. second, middleware systems must incorporate external predicates for integrating autonomous sources typically accessible only by per-object queries. third, fuzzy joins are inherently expensive, as they are essentially user-defined operations that dynamically associate multiple relations. these predicates, being dynamically defined or externally accessed, cannot rely on index mechanisms to provide zero-time sorted output, and must instead require per-object probe to evaluate. the current standard sort-merge framework for ranked queries cannot efficiently handle such predicates because it must completely probe all objects, before sorting and merging them to produce topfi answers. to minimize expensive probes, we thus develop the formal principle of necessary probes, which determines if a probe is absolutely required. we then propose algorithm mpro which, by implementing the principle, is provably optimal with minimal probe cost. further, we show that mpro can scale well and can be easily parallelized. our experiments using both a real-estate benchmark database and synthetic datasets show that mpro enables significant probe reduction, which can be orders of magnitude faster than the standard scheme using complete probing citee id:248 citee title:evaluating top-k queries over web-accessible databases citee abstract:a query to a web search engine usually consists of a list of keywords, to which the search engine responds with the best or top k pages for the query. this top-k query model is prevalent over multimedia collections in general, but also over plain relational data for certain applications. for example, consider a relation with information on available restaurants, including their location, price range for one diner, and overall food rating. a user who queries such a relation might simply specify the users location and target price range, and expect in return the best 10 restaurants in terms of some combination of proximity to the user, closeness of match to the target price range, and overall food rating. processing such top-k queries efficiently is challenging for a number of reasons. one critical such reason is that, in many web applications, the relation attributes might not be available other than through external web-accessible form interfaces, which we will have to query repeatedly for a potentially large set of candidate objects. in this paper, we study how to process topk queries efficiently in this setting, where the attributes for which users specify target values might be handled by external, autonomous sources with a variety of access interfaces. we present several algorithms for processing such queries, and evaluate them thoroughly using both synthetic and real web-accessible data. surrounding text:we study general fuzzy joins as arbitrary probe predicates. more recently, around the same time of our work [11]<2>, some related efforts emerge, addressing the problems of  -joins [12]<2> (which are boolean and not fuzzy, as just explained) and external sources [***]<2>. in contrast, we address a more general and unified problem of expensive predicates. in particular, algorithm mpro can be cast as a specialization of  . several other works [***, 12, 16]<2> also adopt the same basis. our work distinguishes itself in several aspects: first, we aim at a different or more general problem (as contrasted above) influence:2 type:2 pair index:785 citer id:670 citer title:supporting top-k join queries in relational databases citer abstract:ranking queries produce results that are ordered on some computed score. typically, these queries involve joins, where users are usually interested only in the top-k join results. current relational query processors do not handle ranking queries efficiently, especially when joins are involved. in this paper, we address supporting top-k join queries in relational query processors. we introduce a new rank-join algorithm that makes use of the individual orders of its inputs to produce join results ordered on a userspecified scoring function. the idea is to rank the join results progressively during the join operation. we introduce two physical query operators based on variants of ripple join that implement the rankjoin algorithm. the operators are non-blocking and can be integrated into pipelined execution plans. we address several practical issues and optimization heuristics to integrate the new join operators in practical query processors. we implement the new operators inside a prototype database engine based on predator. the experimental evaluation of our approach compares recent algorithms for joining ranked inputs and shows superior performance citee id:247 citee title:top-k selection queries over relational databases: mapping strategies and performance evaluation citee abstract:in many applications, users specify target values for certain attributes, without requiring exact matches to these values in return. instead, the result to such queries is typically a rank of the "top k" tuples that best match the given attribute values. in this paper, we study the advantages and limitations of processing a top-k query by translating it into a single range query that a traditional relational database management system (rdbms) can process efficiently. in particular, we study how to determine a range query to evaluate a top-k query by exploiting the statistics available to an rdbms, and the impact of the quality of these statistics on the retrieval efficiency of the resulting scheme. we also report the first experimental evaluation of the mapping strategies over a real rdbms, namely over microsoft's sql server 7.0. the experiments show that our new techniques are robust and significantly more efficient than previously known strategies requiring at least one sequential scan of the data sets. surrounding text:the top-k join queries are also discussed brie y in [5]<2> as a possible extension to their algorithm to evaluate top-k selection queries. top-k selection queries over relational databases can be mapped into range queries using high dimensional histograms [***]<2>. in [13]<2>, top-k selection queries are evaluated in relational query processors by introducing a new pipelined join operator termed nra-rj influence:3 type:2 pair index:786 citer id:670 citer title:supporting top-k join queries in relational databases citer abstract:ranking queries produce results that are ordered on some computed score. typically, these queries involve joins, where users are usually interested only in the top-k join results. current relational query processors do not handle ranking queries efficiently, especially when joins are involved. in this paper, we address supporting top-k join queries in relational query processors. we introduce a new rank-join algorithm that makes use of the individual orders of its inputs to produce join results ordered on a userspecified scoring function. the idea is to rank the join results progressively during the join operation. we introduce two physical query operators based on variants of ripple join that implement the rankjoin algorithm. the operators are non-blocking and can be integrated into pipelined execution plans. we address several practical issues and optimization heuristics to integrate the new join operators in practical query processors. we implement the new operators inside a prototype database engine based on predator. the experimental evaluation of our approach compares recent algorithms for joining ranked inputs and shows superior performance citee id:248 citee title:evaluating top-k queries over web-accessible databases citee abstract:a query to a web search engine usually consists of a list of keywords, to which the search engine responds with the best or top k pages for the query. this top-k query model is prevalent over multimedia collections in general, but also over plain relational data for certain applications. for example, consider a relation with information on available restaurants, including their location, price range for one diner, and overall food rating. a user who queries such a relation might simply specify the users location and target price range, and expect in return the best 10 restaurants in terms of some combination of proximity to the user, closeness of match to the target price range, and overall food rating. processing such top-k queries efficiently is challenging for a number of reasons. one critical such reason is that, in many web applications, the relation attributes might not be available other than through external web-accessible form interfaces, which we will have to query repeatedly for a potentially large set of candidate objects. in this paper, we study how to process topk queries efficiently in this setting, where the attributes for which users specify target values might be handled by external, autonomous sources with a variety of access interfaces. we present several algorithms for processing such queries, and evaluate them thoroughly using both synthetic and real web-accessible data. surrounding text:, see [9, 10, 15]<2>). in [***]<2>, the authors introduce an algorithm for evaluating 3 top-k selection queries over web-accessible sources assuming that only random access is available for a subset of the sources. chang and hwang [5]<2> address the expensive probing of some of the object scores in top-k selection queries influence:2 type:2 pair index:787 citer id:670 citer title:supporting top-k join queries in relational databases citer abstract:ranking queries produce results that are ordered on some computed score. typically, these queries involve joins, where users are usually interested only in the top-k join results. current relational query processors do not handle ranking queries efficiently, especially when joins are involved. in this paper, we address supporting top-k join queries in relational query processors. we introduce a new rank-join algorithm that makes use of the individual orders of its inputs to produce join results ordered on a userspecified scoring function. the idea is to rank the join results progressively during the join operation. we introduce two physical query operators based on variants of ripple join that implement the rankjoin algorithm. the operators are non-blocking and can be integrated into pipelined execution plans. we address several practical issues and optimization heuristics to integrate the new join operators in practical query processors. we implement the new operators inside a prototype database engine based on predator. the experimental evaluation of our approach compares recent algorithms for joining ranked inputs and shows superior performance citee id:249 citee title:on saying "enough already!" in sql citee abstract: in this paper, we study a simple sql extension that enables query writers to explicitly limit the cardinality of a query result we examine its impact on the query optimization and run - time execution components of a relational dbms, presenting two approaches - a conservative approach and an aggressive approach - to exploiting cardinality limits in relational query plans results obtained from an empirical study conducted using db2 demonstrate the benefits of the sql extension and illustrate the tradeoffs between our two approaches to implementing it surrounding text:2 are attributes of these relations. the stop after operator, introduced in [***, 4]<1>, limits the output to the first five tuples. in q1, the only way to produce ranked results on the expression 0:3a:1+0:7b:2 is by using a sort operator on top of the join. one pipelined execution plan for the query q is the left-deep plan, plan a, given in figure 7. we limit the number of reported answers to k by applying the stop-after query operator [***, 4]<1>. the operator is implemented in the prototype as a physical query operator scan-stop, a straightforward implementation of stop-after and appears on top of the query plan influence:2 type:1 pair index:788 citer id:670 citer title:supporting top-k join queries in relational databases citer abstract:ranking queries produce results that are ordered on some computed score. typically, these queries involve joins, where users are usually interested only in the top-k join results. current relational query processors do not handle ranking queries efficiently, especially when joins are involved. in this paper, we address supporting top-k join queries in relational query processors. we introduce a new rank-join algorithm that makes use of the individual orders of its inputs to produce join results ordered on a userspecified scoring function. the idea is to rank the join results progressively during the join operation. we introduce two physical query operators based on variants of ripple join that implement the rankjoin algorithm. the operators are non-blocking and can be integrated into pipelined execution plans. we address several practical issues and optimization heuristics to integrate the new join operators in practical query processors. we implement the new operators inside a prototype database engine based on predator. the experimental evaluation of our approach compares recent algorithms for joining ranked inputs and shows superior performance citee id:250 citee title:reducing the braking distance of an sql query engine citee abstract:in a recent paper, we proposed adding a stop after clause tosql to permit the cardinality of a query result to be explicitly limitedby query writers and query toolsfi we demonstrated the usefulnessof having this clause, showed how to extend a traditionalcost-based query optimizer to accommodate it, and demonstratedvia db2-based simulations that large performance gains are possiblewhen stop after queries are explicitly supported by thedatabase enginefi in this paper, we present several new surrounding text:2 are attributes of these relations. the stop after operator, introduced in [3, ***]<1>, limits the output to the first five tuples. in q1, the only way to produce ranked results on the expression 0:3a:1+0:7b:2 is by using a sort operator on top of the join. one pipelined execution plan for the query q is the left-deep plan, plan a, given in figure 7. we limit the number of reported answers to k by applying the stop-after query operator [3, ***]<1>. the operator is implemented in the prototype as a physical query operator scan-stop, a straightforward implementation of stop-after and appears on top of the query plan influence:2 type:1 pair index:789 citer id:670 citer title:supporting top-k join queries in relational databases citer abstract:ranking queries produce results that are ordered on some computed score. typically, these queries involve joins, where users are usually interested only in the top-k join results. current relational query processors do not handle ranking queries efficiently, especially when joins are involved. in this paper, we address supporting top-k join queries in relational query processors. we introduce a new rank-join algorithm that makes use of the individual orders of its inputs to produce join results ordered on a userspecified scoring function. the idea is to rank the join results progressively during the join operation. we introduce two physical query operators based on variants of ripple join that implement the rankjoin algorithm. the operators are non-blocking and can be integrated into pipelined execution plans. we address several practical issues and optimization heuristics to integrate the new join operators in practical query processors. we implement the new operators inside a prototype database engine based on predator. the experimental evaluation of our approach compares recent algorithms for joining ranked inputs and shows superior performance citee id:667 citee title:minimal probing: supporting expensive predicates for top-k queries citee abstract:this paper addresses the problem of evaluating ranked topfi queries with expensive predicates. as major dbmss now all support expensive user-defined predicates for boolean queries, we believe such support for ranked queries will be even more important: first, ranked queries often need to model user-specific concepts of preference, relevance, or similarity, which call for dynamic user-defined functions. second, middleware systems must incorporate external predicates for integrating autonomous sources typically accessible only by per-object queries. third, fuzzy joins are inherently expensive, as they are essentially user-defined operations that dynamically associate multiple relations. these predicates, being dynamically defined or externally accessed, cannot rely on index mechanisms to provide zero-time sorted output, and must instead require per-object probe to evaluate. the current standard sort-merge framework for ranked queries cannot efficiently handle such predicates because it must completely probe all objects, before sorting and merging them to produce topfi answers. to minimize expensive probes, we thus develop the formal principle of necessary probes, which determines if a probe is absolutely required. we then propose algorithm mpro which, by implementing the principle, is provably optimal with minimal probe cost. further, we show that mpro can scale well and can be easily parallelized. our experiments using both a real-estate benchmark database and synthetic datasets show that mpro enables significant probe reduction, which can be orders of magnitude faster than the standard scheme using complete probing surrounding text:in [2]<2>, the authors introduce an algorithm for evaluating 3 top-k selection queries over web-accessible sources assuming that only random access is available for a subset of the sources. chang and hwang [***]<2> address the expensive probing of some of the object scores in top-k selection queries. they assume a sorted access on one of the attributes while other scores are obtained through probing or executing some user-defined function on the remaining attributes. in our experimental study, we compare our proposed join operators with the j and show significant enhancement in the overall performance. the top-k join queries are also discussed brie y in [***]<2> as a possible extension to their algorithm to evaluate top-k selection queries. top-k selection queries over relational databases can be mapped into range queries using high dimensional histograms [1]<2> influence:2 type:2 pair index:790 citer id:670 citer title:supporting top-k join queries in relational databases citer abstract:ranking queries produce results that are ordered on some computed score. typically, these queries involve joins, where users are usually interested only in the top-k join results. current relational query processors do not handle ranking queries efficiently, especially when joins are involved. in this paper, we address supporting top-k join queries in relational query processors. we introduce a new rank-join algorithm that makes use of the individual orders of its inputs to produce join results ordered on a userspecified scoring function. the idea is to rank the join results progressively during the join operation. we introduce two physical query operators based on variants of ripple join that implement the rankjoin algorithm. the operators are non-blocking and can be integrated into pipelined execution plans. we address several practical issues and optimization heuristics to integrate the new join operators in practical query processors. we implement the new operators inside a prototype database engine based on predator. the experimental evaluation of our approach compares recent algorithms for joining ranked inputs and shows superior performance citee id:832 citee title:rank aggregation methods for the web citee abstract:we consider the problem of combining ranking results fromvarious sourcesfi in the context of the web, the main applicationsinclude building meta-search engines, combiningranking functions, selecting documents based on multiplecriteria, and improving search precision through word associationsfiwedevelop a set of techniques for the rank aggregationproblem and compare their performance to that ofwell-known methodsfi a primary goal of our work is to designrank aggregation techniques that can surrounding text:1 introduction rank-aware query processing has become a vital need for many applications. in the context of the web, the main applications include building meta-search engines, combining ranking functions and selecting documents based on multiple criteria [***]<1>. efficient rank aggregation is the key to a useful search engine influence:2 type:3 pair index:791 citer id:670 citer title:supporting top-k join queries in relational databases citer abstract:ranking queries produce results that are ordered on some computed score. typically, these queries involve joins, where users are usually interested only in the top-k join results. current relational query processors do not handle ranking queries efficiently, especially when joins are involved. in this paper, we address supporting top-k join queries in relational query processors. we introduce a new rank-join algorithm that makes use of the individual orders of its inputs to produce join results ordered on a userspecified scoring function. the idea is to rank the join results progressively during the join operation. we introduce two physical query operators based on variants of ripple join that implement the rankjoin algorithm. the operators are non-blocking and can be integrated into pipelined execution plans. we address several practical issues and optimization heuristics to integrate the new join operators in practical query processors. we implement the new operators inside a prototype database engine based on predator. the experimental evaluation of our approach compares recent algorithms for joining ranked inputs and shows superior performance citee id:330 citee title:combining fuzzy information from multiple systems citee abstract:in a traditional database system, the result of a query is a set of values (those values that satisfy the query). in other data servers, such as a system with queries based on image content, or many text retrieval systems, the result of a query is a sorted list. for example, in the case of a system with queries based on image content, the query might ask for objects that are a particular shade of red, and the result of the query would be a sorted list of objects in the database, sorted by how ... surrounding text:the problem is tackled in di erent contexts. in middleware environments, fagin [***]<2> and fagin et al$ [8]<2> introduce the first efficient set of algorithms to answer ranking queries. database objects with m attributes are viewed as m separate lists, each supports sorted and, possibly, random access to object scores influence:2 type:2 pair index:792 citer id:670 citer title:supporting top-k join queries in relational databases citer abstract:ranking queries produce results that are ordered on some computed score. typically, these queries involve joins, where users are usually interested only in the top-k join results. current relational query processors do not handle ranking queries efficiently, especially when joins are involved. in this paper, we address supporting top-k join queries in relational query processors. we introduce a new rank-join algorithm that makes use of the individual orders of its inputs to produce join results ordered on a userspecified scoring function. the idea is to rank the join results progressively during the join operation. we introduce two physical query operators based on variants of ripple join that implement the rankjoin algorithm. the operators are non-blocking and can be integrated into pipelined execution plans. we address several practical issues and optimization heuristics to integrate the new join operators in practical query processors. we implement the new operators inside a prototype database engine based on predator. the experimental evaluation of our approach compares recent algorithms for joining ranked inputs and shows superior performance citee id:254 citee title:optimal aggregation algorithms for middleware citee abstract:assume that each object in a database has m grades, or scores, one for eachof m attributes. for example, an object can have a color grade, that tells how red it is, and a shape grade, that tells how round it is. for each attribute, there is a sorted list, which lists each object and its grade under that attribute, sorted by grade (highest grade first). each object is assigned an overall grade, that is obtained by combining the attribute grades using a fixed monotone aggregation function, or combining rule, suchas min or average. to determine the top k objects, that is, k objects with the highest overall grades, the naive algorithm must access every object in the database, to find its grade under each attribute. fagin has given an algorithm (fagins algorithm, or fa) that is much more efficient. for some monotone aggregation functions, fa is optimal with high probability in the worst case. we analyze an elegant and remarkably simple algorithm (the threshold algorithm, or ta) that is optimal in a much stronger sense than fa. we show that ta is essentially optimal, not just for some monotone aggregation functions, but for all of them, and not just in a high-probability worst-case sense, but over every database. unlike fa, which requires large buffers (whose size may grow unboundedly as the database size grows), ta requires only a small, constant-size buffer. ta allows early stopping, which yields, in a precise sense, an approximate version of the top k answers. we distinguish two types of access: sorted access (where the middleware system obtains the grade of an object in some sorted list by proceeding through the list sequentially from the top), and random access (where the middleware system requests the grade of object in a list, and obtains it in one step). we consider the scenarios where random access is either impossible, or expensive relative to sorted access, and provide algorithms that are essentially optimal for these cases as well. r 2003 elsevier science (usa). all rights reserved. surrounding text:the problem is tackled in di erent contexts. in middleware environments, fagin [7]<2> and fagin et al$ [***]<2> introduce the first efficient set of algorithms to answer ranking queries. database objects with m attributes are viewed as m separate lists, each supports sorted and, possibly, random access to object scores. database objects with m attributes are viewed as m separate lists, each supports sorted and, possibly, random access to object scores. the ta algorithm [***]<2> assumes the availability of random access to object scores in any list besides the sorted access to each list. the nra algorithm [***]<2> assumes only sorted access is available to individual lists. the ta algorithm [***]<2> assumes the availability of random access to object scores in any list besides the sorted access to each list. the nra algorithm [***]<2> assumes only sorted access is available to individual lists. similar algorithms are introduced (e. in [13]<2>, top-k selection queries are evaluated in relational query processors by introducing a new pipelined join operator termed nra-rj. nra-rj modifies the nra algorithm [***]<2> to work on ranges of scores instead of requiring the input to have exact scores. nra-rj is an efficient rank-join query operator that joins multiple ranked inputs based on a key-equality condition and cannot handle general join conditions influence:2 type:2 pair index:793 citer id:670 citer title:supporting top-k join queries in relational databases citer abstract:ranking queries produce results that are ordered on some computed score. typically, these queries involve joins, where users are usually interested only in the top-k join results. current relational query processors do not handle ranking queries efficiently, especially when joins are involved. in this paper, we address supporting top-k join queries in relational query processors. we introduce a new rank-join algorithm that makes use of the individual orders of its inputs to produce join results ordered on a userspecified scoring function. the idea is to rank the join results progressively during the join operation. we introduce two physical query operators based on variants of ripple join that implement the rankjoin algorithm. the operators are non-blocking and can be integrated into pipelined execution plans. we address several practical issues and optimization heuristics to integrate the new join operators in practical query processors. we implement the new operators inside a prototype database engine based on predator. the experimental evaluation of our approach compares recent algorithms for joining ranked inputs and shows superior performance citee id:465 citee title:optimizing multi-feature queries for image databases citee abstract:: in digital libraries image retrieval queries can bebased on the similarity of objects, using severalfeature attributes like shape, texture, color or textfi surrounding text:g. , see [***, 10, 15]<2>). in [2]<2>, the authors introduce an algorithm for evaluating 3 top-k selection queries over web-accessible sources assuming that only random access is available for a subset of the sources influence:3 type:2 pair index:794 citer id:670 citer title:supporting top-k join queries in relational databases citer abstract:ranking queries produce results that are ordered on some computed score. typically, these queries involve joins, where users are usually interested only in the top-k join results. current relational query processors do not handle ranking queries efficiently, especially when joins are involved. in this paper, we address supporting top-k join queries in relational query processors. we introduce a new rank-join algorithm that makes use of the individual orders of its inputs to produce join results ordered on a userspecified scoring function. the idea is to rank the join results progressively during the join operation. we introduce two physical query operators based on variants of ripple join that implement the rankjoin algorithm. the operators are non-blocking and can be integrated into pipelined execution plans. we address several practical issues and optimization heuristics to integrate the new join operators in practical query processors. we implement the new operators inside a prototype database engine based on predator. the experimental evaluation of our approach compares recent algorithms for joining ranked inputs and shows superior performance citee id:669 citee title:towards efficient multi-feature queries in heterogeneous environments citee abstract:applications like multimedia databases or enterprisewideinformation management systems have to meet thechallenge of efficiently retrieving best matching objectsfrom vast collections of datafi we present a new algorithmstream-combine for processing multi-feature queries onheterogeneous data sourcesfi stream-combine is selfadaptingto different data distributions and to the specifickind of the combining functionfi furthermore we present anew retrieval strategy that will essentially speed up surrounding text:g. , see [9, ***, 15]<2>). in [2]<2>, the authors introduce an algorithm for evaluating 3 top-k selection queries over web-accessible sources assuming that only random access is available for a subset of the sources influence:3 type:2 pair index:795 citer id:670 citer title:supporting top-k join queries in relational databases citer abstract:ranking queries produce results that are ordered on some computed score. typically, these queries involve joins, where users are usually interested only in the top-k join results. current relational query processors do not handle ranking queries efficiently, especially when joins are involved. in this paper, we address supporting top-k join queries in relational query processors. we introduce a new rank-join algorithm that makes use of the individual orders of its inputs to produce join results ordered on a userspecified scoring function. the idea is to rank the join results progressively during the join operation. we introduce two physical query operators based on variants of ripple join that implement the rankjoin algorithm. the operators are non-blocking and can be integrated into pipelined execution plans. we address several practical issues and optimization heuristics to integrate the new join operators in practical query processors. we implement the new operators inside a prototype database engine based on predator. the experimental evaluation of our approach compares recent algorithms for joining ranked inputs and shows superior performance citee id:837 citee title:ripple joins for online aggregation citee abstract:we present a new family of join algorithms, called ripple joins, for online processing of multi-table aggregation queries in a rela- tional database management system (dbms). such queries arise naturally in interactive exploratory decision-support applications. traditional offline join algorithms are designed to minimize the time to completion of the query. in contrast, ripple joins are designed to minimize the time until an acceptably precise esti- mate of the query result is available, as measured by the length of a confidence interval. ripple joins are adaptive, adjusting their behavior during processing in accordance with the statis- tical properties of the data. ripple joins also permit the user to dynamically trade off the two key performance factors of on- line aggregation: the time between successive updates of the run- ning aggregate, and the amount by which the confidence-interval length decreases at each update. we show how ripple joins can be implemented in an existing dbms using iterators, and we give an overview of the methods used to compute confidence intervals and to adaptively optimize the ripple join aspect-ratio parameters. in experiments with an initial implementation of our algorithms in the postgres dbms, the time required to produce reasonably precise online estimates was up to two orders of magnitude smaller than the time required for the best offline join algorithms to pro- duce exact answers. surrounding text:in [13]<2>, it is shown both analytically and experimentally that nra-rj is superior to j for equality join conditions on key attributes. 3 an overview on ripple join ripple join is a family of join algorithms introduced in [***]<1> in the context of online processing of aggregation queries in a relational dbms. traditional join algorithms are designed to minimize the time till completion. 5. 1 hash rank join operator (hrjn) hrjn can be viewed as a variant of the symmetrical hash join algorithm [12, 19]<1> or the hash ripple join algorithm [***]<1>. the open method is given in table 1. we call this problem the local ranking problem. solving the local ranking problem another version of ripple join is the blocked ripple join [***]<1>. at each step, the algorithm retrieves a new block of one relation, scans all the old tuples of the other relation, and joins each tuple in the new block with the corresponding tuples there influence:1 type:1 pair index:796 citer id:670 citer title:supporting top-k join queries in relational databases citer abstract:ranking queries produce results that are ordered on some computed score. typically, these queries involve joins, where users are usually interested only in the top-k join results. current relational query processors do not handle ranking queries efficiently, especially when joins are involved. in this paper, we address supporting top-k join queries in relational query processors. we introduce a new rank-join algorithm that makes use of the individual orders of its inputs to produce join results ordered on a userspecified scoring function. the idea is to rank the join results progressively during the join operation. we introduce two physical query operators based on variants of ripple join that implement the rankjoin algorithm. the operators are non-blocking and can be integrated into pipelined execution plans. we address several practical issues and optimization heuristics to integrate the new join operators in practical query processors. we implement the new operators inside a prototype database engine based on predator. the experimental evaluation of our approach compares recent algorithms for joining ranked inputs and shows superior performance citee id:802 citee title:optimization of parallel query execution plans in xprs citee abstract:the authors describe their approach to the optimization of query execution plans in xprs, a multi-user parallel database machine based on a shared-memory multi-processor and a disk array. the main difficulties in this optimization problem are the compile-time unknown parameters such as available buffer size and number of free processors, and the enormous search space of possible parallel plans. the authors deal with these problems with a novel two phase optimization strategy which dramatically reduces the search space and allows run time parameters without significantly compromising plan optimality. they present their two phase strategy and give experimental evidence from xprs benchmarks that indicate that it almost always produces optimal plans surrounding text:in this case, calculating the join condition of a new sampled tuple with previously sampled tuples is very fast (saving i/o). the second variant is exactly the symmetric hash join [***, 19]<1> that allows a high degree of pipelining in parallel databases. when the hash tables grow in size and exceed memory size, the hash ripple join falls back to block ripple join. 5. 1 hash rank join operator (hrjn) hrjn can be viewed as a variant of the symmetrical hash join algorithm [***, 19]<1> or the hash ripple join algorithm [11]<1>. the open method is given in table 1 influence:1 type:1 pair index:797 citer id:670 citer title:supporting top-k join queries in relational databases citer abstract:ranking queries produce results that are ordered on some computed score. typically, these queries involve joins, where users are usually interested only in the top-k join results. current relational query processors do not handle ranking queries efficiently, especially when joins are involved. in this paper, we address supporting top-k join queries in relational query processors. we introduce a new rank-join algorithm that makes use of the individual orders of its inputs to produce join results ordered on a userspecified scoring function. the idea is to rank the join results progressively during the join operation. we introduce two physical query operators based on variants of ripple join that implement the rankjoin algorithm. the operators are non-blocking and can be integrated into pipelined execution plans. we address several practical issues and optimization heuristics to integrate the new join operators in practical query processors. we implement the new operators inside a prototype database engine based on predator. the experimental evaluation of our approach compares recent algorithms for joining ranked inputs and shows superior performance citee id:679 citee title:joining ranked inputs in practice citee abstract:joining ranked inputs is an essential requirement for many database applications, such as ranking search results from multiple search engines and answering multi-feature queries for multimedia retrieval systems. we introduce a new practical pipelined query operator, termed nra-rj, that produces a global rank from input ranked streams based on a score function. the output of nra-rj can serve as a valid input to other nra-rj operators in the query pipeline. hence, the nra-rj operator can support... surrounding text:top-k selection queries over relational databases can be mapped into range queries using high dimensional histograms [1]<2>. in [***]<2>, top-k selection queries are evaluated in relational query processors by introducing a new pipelined join operator termed nra-rj. nra-rj modifies the nra algorithm [8]<2> to work on ranges of scores instead of requiring the input to have exact scores. nra-rj is an efficient rank-join query operator that joins multiple ranked inputs based on a key-equality condition and cannot handle general join conditions. in [***]<2>, it is shown both analytically and experimentally that nra-rj is superior to j for equality join conditions on key attributes. 3 an overview on ripple join ripple join is a family of join algorithms introduced in [11]<1> in the context of online processing of aggregation queries in a relational dbms influence:1 type:2 pair index:798 citer id:670 citer title:supporting top-k join queries in relational databases citer abstract:ranking queries produce results that are ordered on some computed score. typically, these queries involve joins, where users are usually interested only in the top-k join results. current relational query processors do not handle ranking queries efficiently, especially when joins are involved. in this paper, we address supporting top-k join queries in relational query processors. we introduce a new rank-join algorithm that makes use of the individual orders of its inputs to produce join results ordered on a userspecified scoring function. the idea is to rank the join results progressively during the join operation. we introduce two physical query operators based on variants of ripple join that implement the rankjoin algorithm. the operators are non-blocking and can be integrated into pipelined execution plans. we address several practical issues and optimization heuristics to integrate the new join operators in practical query processors. we implement the new operators inside a prototype database engine based on predator. the experimental evaluation of our approach compares recent algorithms for joining ranked inputs and shows superior performance citee id:724 citee title:supporting incremental join queries on ranked inputs citee abstract:this paper investigates the problem of incrementaljoins of multiple ranked data sets when the joincondition is a list of arbitrary user-defined predicateson the input tuplesfi this problem arises inmany important applications dealing with orderedinputs and multiple ranked data sets, and requiringthe top k solutionsfi we use multimedia applicationsas the motivating examples but the problemis equally applicable to traditional database applicationsinvolving optimal resource surrounding text:they assume a sorted access on one of the attributes while other scores are obtained through probing or executing some user-defined function on the remaining attributes. a more general problem is addressed in [***]<2>. the authors introduce the j algorithm to join multiple ranked inputs to produce a global rank. 7. 2 comparing the rank-join operators in this section, we evaluate the performance of the introduced operators by comparing them with each other and with a rank-join operator based on the j algorithm [***]<2>. we limit our presentation to comparing three rank-join operators: the basic hrjn operator, the hrjn operator and the j operator influence:1 type:2 pair index:799 citer id:670 citer title:supporting top-k join queries in relational databases citer abstract:ranking queries produce results that are ordered on some computed score. typically, these queries involve joins, where users are usually interested only in the top-k join results. current relational query processors do not handle ranking queries efficiently, especially when joins are involved. in this paper, we address supporting top-k join queries in relational query processors. we introduce a new rank-join algorithm that makes use of the individual orders of its inputs to produce join results ordered on a userspecified scoring function. the idea is to rank the join results progressively during the join operation. we introduce two physical query operators based on variants of ripple join that implement the rankjoin algorithm. the operators are non-blocking and can be integrated into pipelined execution plans. we address several practical issues and optimization heuristics to integrate the new join operators in practical query processors. we implement the new operators inside a prototype database engine based on predator. the experimental evaluation of our approach compares recent algorithms for joining ranked inputs and shows superior performance citee id:527 citee title:query processing issues in image (multimedia) databases citee abstract:multimedia databases have attracted academic and industrialinterest, and systems such as qbic (content basedimage retrieval system from ibm) have been releasedfi suchsystems are essential to effectively and efficiently use theexisting large collections of image data in the modern computingenvironmentfi the aim of such systems is to enableretrieval of images based on their contentsfi this problemhas brought together the (decades old) database and imageprocessing communitiesfias part of our surrounding text:g. , see [9, 10, ***]<2>). in [2]<2>, the authors introduce an algorithm for evaluating 3 top-k selection queries over web-accessible sources assuming that only random access is available for a subset of the sources influence:3 type:2 pair index:800 citer id:670 citer title:supporting top-k join queries in relational databases citer abstract:ranking queries produce results that are ordered on some computed score. typically, these queries involve joins, where users are usually interested only in the top-k join results. current relational query processors do not handle ranking queries efficiently, especially when joins are involved. in this paper, we address supporting top-k join queries in relational query processors. we introduce a new rank-join algorithm that makes use of the individual orders of its inputs to produce join results ordered on a userspecified scoring function. the idea is to rank the join results progressively during the join operation. we introduce two physical query operators based on variants of ripple join that implement the rankjoin algorithm. the operators are non-blocking and can be integrated into pipelined execution plans. we address several practical issues and optimization heuristics to integrate the new join operators in practical query processors. we implement the new operators inside a prototype database engine based on predator. the experimental evaluation of our approach compares recent algorithms for joining ranked inputs and shows superior performance citee id:852 citee title:xjoin: a reactively- scheduled pipelined join operator citee abstract:wide-area distribution raises significant performance problems for traditional query processing techniques as data access becomes less predictable due to link congestion, load imbalances, and temporary outages. pipelined query execution is a promising approach to coping with unpredictability in such environments as it allows scheduling to adjust to the arrival properties of the data. we have developed a non-blocking join operator, called xjoin, which has a small memory footprint surrounding text:otherwise, the available input is processed. hrjn can be easily adapted to use xjoin [***]<2>. xjoin is a practical adaptive version of the symmetric hash join operator influence:3 type:3 pair index:801 citer id:675 citer title:optimized query execution in large search engines with global page ordering citer abstract:large web search engines have to answer thousands of queries per second with interactive response times. a major factor in the cost of executing a query is given by the lengths of the inverted lists for the query terms, which increase with the size of the document collection and are often in the range of many megabytes. to address this issue, ir and database researchers have proposed pruning techniques that compute or approximate term-based ranking functions without scanning over the full inverted lists. over the last few years, search engines have incorporated new types of ranking techniques that exploit aspects such as the hyperlink structure of the web or the popularity of a page to obtain improved results. we focus on the question of how such techniques can be efficiently integrated into query processing. in particular, we study ʿҲʶpruning techniques for query execution in large engines in the case where we have a global ranking of pages, as provided by pagerank or any other method, in addition to the standard term-based approach. we describe pruning schemes for this case and evaluate their efficiency on an experimental clusterbased search engine with  million web pages. our results show that there is significant potential benefit in such techniques. citee id:471 citee title:vector-space ranking with effective early termination citee abstract:considerable research effort has been invested in improving the effectiveness of information retrieval systems. techniques such as relevance feedback, thesaural expansion, and pivoting all provide better quality responses to queries when tested in standard evaluation frameworks. but such enhancements can add to the cost of evaluating queries. in this paper we consider the pragmatic issue of how to improve the cost-effectiveness of searching. we describe a new inverted file structure using quantized weights that provides superior retrieval effectiveness compared to conventional inverted file structures when early termination heuristics are employed. that is, we are able to reach similar effectiveness levels with less computational cost, and so provide a better cost/performance compromise than previous inverted file organisations. surrounding text:g. , [***, 2, 14, 15, 18, 32]<2> for recent work and [38]<0> for an overview of older work. typically, these techniques reorder the inverted lists such that highscoring documents are likely to be at the beginning, and then terminate the search over the inverted lists once most of the high-scoring documents have been scored. some early work is described in [11, 21, 38, 41]<2>. most relevant to our work are the techniques by persin, zobel, and sacks-davis [32]<2> and the follow-up work in [***, 2]<2>, which study effective early termination (pruning) schemes for the cosine measure based on the idea of sorting the postings in the inverted lists by their contribution to the score of the document. a scheme in [32]<2> also proposed to partition the inverted list into several partitions, somewhat reminiscent of our scheme in subsection 5 influence:2 type:2 pair index:802 citer id:675 citer title:optimized query execution in large search engines with global page ordering citer abstract:large web search engines have to answer thousands of queries per second with interactive response times. a major factor in the cost of executing a query is given by the lengths of the inverted lists for the query terms, which increase with the size of the document collection and are often in the range of many megabytes. to address this issue, ir and database researchers have proposed pruning techniques that compute or approximate term-based ranking functions without scanning over the full inverted lists. over the last few years, search engines have incorporated new types of ranking techniques that exploit aspects such as the hyperlink structure of the web or the popularity of a page to obtain improved results. we focus on the question of how such techniques can be efficiently integrated into query processing. in particular, we study ʿҲʶpruning techniques for query execution in large engines in the case where we have a global ranking of pages, as provided by pagerank or any other method, in addition to the standard term-based approach. we describe pruning schemes for this case and evaluate their efficiency on an experimental clusterbased search engine with  million web pages. our results show that there is significant potential benefit in such techniques. citee id:804 citee title:searching the web citee abstract:we offer an overview of current web search engine design. after introducing a generic search enginearchitecture, we examine each engine component in turn. we cover crawling, local web page storage,indexing, and the use of link analysis for boosting search performance. the most common design andimplementation techniques for each of these components are presented. we draw for this presentationfrom the literature, and from our own experimental search engine testbed. emphasis is on introducingthe fundamental concepts, and the results of several performance analyses we conducted to comparedifferent designs. surrounding text:we intend to include these results in a longer journal version of this paper. 4 discussion of related work for background on indexing and query execution in ir and search engines, we refer to [***, 5, 40]<1>, and for basics of parallel search engine architecture we refer to [7, 8, 26, 34]<1>. discussions and comparisons of local and global index partitioning schemes and their performance are given, e influence:2 type:3 pair index:803 citer id:675 citer title:optimized query execution in large search engines with global page ordering citer abstract:large web search engines have to answer thousands of queries per second with interactive response times. a major factor in the cost of executing a query is given by the lengths of the inverted lists for the query terms, which increase with the size of the document collection and are often in the range of many megabytes. to address this issue, ir and database researchers have proposed pruning techniques that compute or approximate term-based ranking functions without scanning over the full inverted lists. over the last few years, search engines have incorporated new types of ranking techniques that exploit aspects such as the hyperlink structure of the web or the popularity of a page to obtain improved results. we focus on the question of how such techniques can be efficiently integrated into query processing. in particular, we study ʿҲʶpruning techniques for query execution in large engines in the case where we have a global ranking of pages, as provided by pagerank or any other method, in addition to the standard term-based approach. we describe pruning schemes for this case and evaluate their efficiency on an experimental clusterbased search engine with  million web pages. our results show that there is significant potential benefit in such techniques. citee id:442 citee title:distributed query processing using partitioned inverted files citee abstract:in this paper, we study query processing in a distributed text databasefi the novelty is a real distributed architec-ture implementation that offers concurrent query servicefi the distributed system adopts a network of workstations model and the client-server paradigmfi the document col-lection is indexed with an inverted filefi we adopt two dis-tinct strategies of index partitioning in the distributed sys-tem, namely local index partitioning and global index parti-tioningfi in both strategies, documents are ranked using the vector space model along with a document filtering tech-nique for fast rankingfi we evaluate and compare the impact of the two index partitioning strategies on query processing performancefi experimental results on retrieval efficiency show that, within our framework, the global index parti-tioning outperforms the local index partitioningfi surrounding text:2 each scheme has advantages and disadvantages that we do not have space to discuss here. see [***, 28, 37]<2>. in this paper, we assume a local index organization, although some of the ideas also apply to a global index influence:2 type:2 pair index:804 citer id:675 citer title:optimized query execution in large search engines with global page ordering citer abstract:large web search engines have to answer thousands of queries per second with interactive response times. a major factor in the cost of executing a query is given by the lengths of the inverted lists for the query terms, which increase with the size of the document collection and are often in the range of many megabytes. to address this issue, ir and database researchers have proposed pruning techniques that compute or approximate term-based ranking functions without scanning over the full inverted lists. over the last few years, search engines have incorporated new types of ranking techniques that exploit aspects such as the hyperlink structure of the web or the popularity of a page to obtain improved results. we focus on the question of how such techniques can be efficiently integrated into query processing. in particular, we study ʿҲʶpruning techniques for query execution in large engines in the case where we have a global ranking of pages, as provided by pagerank or any other method, in addition to the standard term-based approach. we describe pruning schemes for this case and evaluate their efficiency on an experimental clusterbased search engine with  million web pages. our results show that there is significant potential benefit in such techniques. citee id:245 citee title:modern information retrieval citee abstract:information retrieval (ir) has changed considerably in the last years with the expansion of the web (world wide web) and the advent of modern and inexpensive graphical user interfaces and mass storage devices. as a result, traditional ir textbooks have become quite out-of-date which has led to the introduction of new ir books recently. nevertheless, we believe that there is still great need of a book that approaches the field in a rigorous and complete way from a computer-science perspective (in opposition to a user-centered perspective). this book is an effort to partially fulfill this gap and should be useful for a first course on information retrieval as well as for a graduate course on the topic. these www pages are not a digital version of the book, nor the complete contents of it. here you will find the preface, table of contents, glossary and two chapters available for reading on-line. the printed version can be ordered directly from addison-wesley-longman. surrounding text:we intend to include these results in a longer journal version of this paper. 4 discussion of related work for background on indexing and query execution in ir and search engines, we refer to [3, ***, 40]<1>, and for basics of parallel search engine architecture we refer to [7, 8, 26, 34]<1>. discussions and comparisons of local and global index partitioning schemes and their performance are given, e influence:2 type:3 pair index:805 citer id:675 citer title:optimized query execution in large search engines with global page ordering citer abstract:large web search engines have to answer thousands of queries per second with interactive response times. a major factor in the cost of executing a query is given by the lengths of the inverted lists for the query terms, which increase with the size of the document collection and are often in the range of many megabytes. to address this issue, ir and database researchers have proposed pruning techniques that compute or approximate term-based ranking functions without scanning over the full inverted lists. over the last few years, search engines have incorporated new types of ranking techniques that exploit aspects such as the hyperlink structure of the web or the popularity of a page to obtain improved results. we focus on the question of how such techniques can be efficiently integrated into query processing. in particular, we study ʿҲʶpruning techniques for query execution in large engines in the case where we have a global ranking of pages, as provided by pagerank or any other method, in addition to the standard term-based approach. we describe pruning schemes for this case and evaluate their efficiency on an experimental clusterbased search engine with  million web pages. our results show that there is significant potential benefit in such techniques. citee id:567 citee title:finding authorities and hubs from link structures on the world wide web citee abstract:recently, ther have been anumberofalgorh#-9prc osedforanalyzing hyperdcc linkstr5#tdsoastodetercdthebest"authorodhum fora given topicorquer fi while suchanalysis is usually combined with content analysis, ther isa sense in which somealgoruo-9ar deemed to be"morbalanced" andother"morfocused"fiweunderdea compar--c e study of hyperdtd link analysisalgor-cdco guidedby some experd#c talquer5cdwe prd ose somefor-c crc5orforevaluating andcompartlinkanalysisalgor-cuh5 surrounding text:g. , [***, 13, 22, 24, 25, 33]<2>, that perform link-based ranking either at query time or as a preprocessing step. integrating term-based and other factors: despite the large amount of work on link-based ranking, there is almost no published work on how to efficiently integrate the techniques into a large search engine. a large amount of recent work has focused on link-based ranking and analysis schemes. see [***, 22, 24, 25, 31, 33]<2> for a small sample. previous work on pruning techniques for top-( queries can be divided into two fairly disjoint sets of literature influence:2 type:2 pair index:806 citer id:675 citer title:optimized query execution in large search engines with global page ordering citer abstract:large web search engines have to answer thousands of queries per second with interactive response times. a major factor in the cost of executing a query is given by the lengths of the inverted lists for the query terms, which increase with the size of the document collection and are often in the range of many megabytes. to address this issue, ir and database researchers have proposed pruning techniques that compute or approximate term-based ranking functions without scanning over the full inverted lists. over the last few years, search engines have incorporated new types of ranking techniques that exploit aspects such as the hyperlink structure of the web or the popularity of a page to obtain improved results. we focus on the question of how such techniques can be efficiently integrated into query processing. in particular, we study ʿҲʶpruning techniques for query execution in large engines in the case where we have a global ranking of pages, as provided by pagerank or any other method, in addition to the standard term-based approach. we describe pruning schemes for this case and evaluate their efficiency on an experimental clusterbased search engine with  million web pages. our results show that there is significant potential benefit in such techniques. citee id:699 citee title:lessons from giant scale services citee abstract:web portals and isps such as aol, microsoft network, and yahoo have grown more than tenfold in the past five years. despite their scale, growth rates, and rapid evolution of content and features, these sites and other giant-scale services like instant messaging and napster must be always available. many other major web sites such as ebay, cnn, and wal-mart, have similar availability requirements. in this article, i look at the basic model for such services, focusing on the key realworld challenges they face high availability, evolution, and growth and developing some principles for attacking these problems. surrounding text:3 we note that pruning schemes such as the ones we propose are particularly attractive during peak times when the query load is significantly larger than average, and they can adapt to the load in a continuous and online fashion. it appears that several search engines are already using techniques for dealing with high loads by modifying query execution [***]<1>, though no details have been published. second, we believe our results are interesting in the context of the ongoing discussion about different approaches to link analysis, in particular the issues of preprocessingbased [22, 31, 33]<2> vs. we intend to include these results in a longer journal version of this paper. 4 discussion of related work for background on indexing and query execution in ir and search engines, we refer to [3, 5, 40]<1>, and for basics of parallel search engine architecture we refer to [***, 8, 26, 34]<1>. discussions and comparisons of local and global index partitioning schemes and their performance are given, e influence:3 type:2 pair index:807 citer id:675 citer title:optimized query execution in large search engines with global page ordering citer abstract:large web search engines have to answer thousands of queries per second with interactive response times. a major factor in the cost of executing a query is given by the lengths of the inverted lists for the query terms, which increase with the size of the document collection and are often in the range of many megabytes. to address this issue, ir and database researchers have proposed pruning techniques that compute or approximate term-based ranking functions without scanning over the full inverted lists. over the last few years, search engines have incorporated new types of ranking techniques that exploit aspects such as the hyperlink structure of the web or the popularity of a page to obtain improved results. we focus on the question of how such techniques can be efficiently integrated into query processing. in particular, we study ʿҲʶpruning techniques for query execution in large engines in the case where we have a global ranking of pages, as provided by pagerank or any other method, in addition to the standard term-based approach. we describe pruning schemes for this case and evaluate their efficiency on an experimental clusterbased search engine with  million web pages. our results show that there is significant potential benefit in such techniques. citee id:492 citee title:the anatomy of a large-scale hypertextual web search engine citee abstract:in this paper, we present google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. google is designed to crawl and index the web efficiently and produce much more satisfying search results than existing systems. the prototype with a full text and hyperlink database of at least 24 million pages is available at http:h@oogle.stanford.edu/to engineer a search engine is a challenging task. search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. they answer tens of n-tillions of queries every day. despite the importance of large-scale search engines on the web, very little academic research has been done on them. furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago. this paper provides an in-depth description of our large-scale web search engine -- the first such detailed public description we know of to date.apart from the problems of scaling traditional searchtechniques to data of this magnitude, there are new technicalchallenges involved with using the additional information present in hypertext to produce better search results. this paper addresses this question of how to build a practical large-scale system which can exploit the additional infon-nation present in hypertext. also we look at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want. surrounding text:, techniques that use the hyperlink (or graph) structure of the web to identify interesting pages or relationships between pages. one important technique is the pagerank technique underlying the google search engine [***]<3>, which assigns a global importance measure to each page on the web that is proportional to the number and importance of other pages linking to it. a number of other approaches have also been proposed, see, e. while it has been argued that a query-dependent link analysis might give better results, global techniques such as pagerank that allow precomputation are very attractive for reasons of efficiency and simplicity. brin and page allude to the possible advantages in their paper on the architecture of the original google engine [***]<3>, which contains what is essentially a simple term-based pruning technique based on the idea of fancy hits. a natural way to build a combined ranking function is to add up a term-based score and a suitably normalized score derived, e influence:2 type:2 pair index:808 citer id:675 citer title:optimized query execution in large search engines with global page ordering citer abstract:large web search engines have to answer thousands of queries per second with interactive response times. a major factor in the cost of executing a query is given by the lengths of the inverted lists for the query terms, which increase with the size of the document collection and are often in the range of many megabytes. to address this issue, ir and database researchers have proposed pruning techniques that compute or approximate term-based ranking functions without scanning over the full inverted lists. over the last few years, search engines have incorporated new types of ranking techniques that exploit aspects such as the hyperlink structure of the web or the popularity of a page to obtain improved results. we focus on the question of how such techniques can be efficiently integrated into query processing. in particular, we study ʿҲʶpruning techniques for query execution in large engines in the case where we have a global ranking of pages, as provided by pagerank or any other method, in addition to the standard term-based approach. we describe pruning schemes for this case and evaluate their efficiency on an experimental clusterbased search engine with  million web pages. our results show that there is significant potential benefit in such techniques. citee id:776 citee title:on the resemblance and containment of documents citee abstract:given two documents a and b we de ne two mathematical notions: their resemblance r(a; b) and their containment c(a; b) that seem to capture well the informal notions of \roughly the same" and \roughly containedfi " the basic idea is to reduce these issues to set intersection problems that can be easily evaluated by a process of random sampling that can be done inde-pendently for each documentfi furthermore, the resemblance can be evaluated using a xed size sample for each documentfi this paper discusses the mathematical properties of these measures and the e cient implementation of the sampling process using rabin ngerprintsfi surrounding text:the procedure terminates whenever the probability of having the correct top-( results goes above some threshold, say    or    . we note that statistics could be kept in the form of a histogram or a zipf parameter for each inverted list that upperbounds the value distribution, and that correlations could be estimated based on hashing techniques similar to those in [***]<1>. for the fancy list organization, we have only implemented a very crude version of this idea where we terminate the scan whenever the number of candidates in influence:3 type:1 pair index:809 citer id:675 citer title:optimized query execution in large search engines with global page ordering citer abstract:large web search engines have to answer thousands of queries per second with interactive response times. a major factor in the cost of executing a query is given by the lengths of the inverted lists for the query terms, which increase with the size of the document collection and are often in the range of many megabytes. to address this issue, ir and database researchers have proposed pruning techniques that compute or approximate term-based ranking functions without scanning over the full inverted lists. over the last few years, search engines have incorporated new types of ranking techniques that exploit aspects such as the hyperlink structure of the web or the popularity of a page to obtain improved results. we focus on the question of how such techniques can be efficiently integrated into query processing. in particular, we study ʿҲʶpruning techniques for query execution in large engines in the case where we have a global ranking of pages, as provided by pagerank or any other method, in addition to the standard term-based approach. we describe pruning schemes for this case and evaluate their efficiency on an experimental clusterbased search engine with  million web pages. our results show that there is significant potential benefit in such techniques. citee id:248 citee title:evaluating top-k queries over web-accessible databases citee abstract:a query to a web search engine usually consists of a list of keywords, to which the search engine responds with the best or top k pages for the query. this top-k query model is prevalent over multimedia collections in general, but also over plain relational data for certain applications. for example, consider a relation with information on available restaurants, including their location, price range for one diner, and overall food rating. a user who queries such a relation might simply specify the users location and target price range, and expect in return the best 10 restaurants in terms of some combination of proximity to the user, closeness of match to the target price range, and overall food rating. processing such top-k queries efficiently is challenging for a number of reasons. one critical such reason is that, in many web applications, the relation attributes might not be available other than through external web-accessible form interfaces, which we will have to query repeatedly for a potentially large set of candidate objects. in this paper, we study how to process topk queries efficiently in this setting, where the attributes for which users specify target values might be handled by external, autonomous sources with a variety of access interfaces. we present several algorithms for processing such queries, and evaluate them thoroughly using both synthetic and real web-accessible data. surrounding text:this gives much better pruning than sorted access alone, but in a search engine context it may not be efficient as it results in many random lookups on disk. 5 other schemes [***, 18, 20]<2> work without random lookups or in cases where only some of the lists are sorted. thus, there are a variety of previous pruning schemes, though many have objectives or assumptions that are different from ours influence:2 type:2 pair index:810 citer id:675 citer title:optimized query execution in large search engines with global page ordering citer abstract:large web search engines have to answer thousands of queries per second with interactive response times. a major factor in the cost of executing a query is given by the lengths of the inverted lists for the query terms, which increase with the size of the document collection and are often in the range of many megabytes. to address this issue, ir and database researchers have proposed pruning techniques that compute or approximate term-based ranking functions without scanning over the full inverted lists. over the last few years, search engines have incorporated new types of ranking techniques that exploit aspects such as the hyperlink structure of the web or the popularity of a page to obtain improved results. we focus on the question of how such techniques can be efficiently integrated into query processing. in particular, we study ʿҲʶpruning techniques for query execution in large engines in the case where we have a global ranking of pages, as provided by pagerank or any other method, in addition to the standard term-based approach. we describe pruning schemes for this case and evaluate their efficiency on an experimental clusterbased search engine with  million web pages. our results show that there is significant potential benefit in such techniques. citee id:493 citee title:optimization of inverted vector searches citee abstract:a simple algorithm is presented for increasing the efficiency of information retrieval searches which are implemented using inverted files. this optimization algorithm employs knowledge about the methods used for weighting document and query terms in order to examine as few inverted lists as possible. an extension to the basic algorithm allows greatly increased performance optimization at a modest cost in retrieval effectiveness. experimental runs are made examining several different term weighting models and showing the optimization possible with each. surrounding text:in the ir community, researcher have studied pruning techniques for the fast evaluation of vector space queries since at least the 1980s. some early work is described in [***, 21, 38, 41]<2>. most relevant to our work are the techniques by persin, zobel, and sacks-davis [32]<2> and the follow-up work in [1, 2]<2>, which study effective early termination (pruning) schemes for the cosine measure based on the idea of sorting the postings in the inverted lists by their contribution to the score of the document influence:2 type:2 pair index:811 citer id:675 citer title:optimized query execution in large search engines with global page ordering citer abstract:large web search engines have to answer thousands of queries per second with interactive response times. a major factor in the cost of executing a query is given by the lengths of the inverted lists for the query terms, which increase with the size of the document collection and are often in the range of many megabytes. to address this issue, ir and database researchers have proposed pruning techniques that compute or approximate term-based ranking functions without scanning over the full inverted lists. over the last few years, search engines have incorporated new types of ranking techniques that exploit aspects such as the hyperlink structure of the web or the popularity of a page to obtain improved results. we focus on the question of how such techniques can be efficiently integrated into query processing. in particular, we study ʿҲʶpruning techniques for query execution in large engines in the case where we have a global ranking of pages, as provided by pagerank or any other method, in addition to the standard term-based approach. we describe pruning schemes for this case and evaluate their efficiency on an experimental clusterbased search engine with  million web pages. our results show that there is significant potential benefit in such techniques. citee id:517 citee title:evaluating the performance of distributed architectures for information retrieval using a variety of workloads citee abstract:the information explosion across the internet and elswhere offers access to an increasing number of document collections. in order for users to effectively access these collections, information retrieval (ir) systems must provide coordinated, concurrent, and distributed access. in this article, we explore how to achieve scalable performance in a distributed system for collection sizes ranging from 1gb to 128gb. we implement a fully functional distributed ir system based on a multithreaded version of the inquery simulation model. we measure performance as a function of system parameters such as client command rate, number of document collections, ter ms per query, query term frequency, number of answers returned, and command mixture. our results show that it is important to model both query and document commands because the heterogeneity of commands significantly impacts performance. based on our results, we recommend simple changes to the prototype and evaluate the changes using the simulator. because of the significant resource demands of information retrieval, it is not difficult to generate workloads that overwhelm system resources regardless of the architecture. however under some realistic workloads, we demonstrate system organizations for which response time gracefully degrades as the workload increases and performance scales with the number of processors. this scalable architecture includes a surprisingly small number of brokers through which a large number of clients and servers communicate. surrounding text:g. , in [4, ***, 23, 28, 37]<1>. a large amount of recent work has focused on link-based ranking and analysis schemes influence:2 type:3 pair index:812 citer id:675 citer title:optimized query execution in large search engines with global page ordering citer abstract:large web search engines have to answer thousands of queries per second with interactive response times. a major factor in the cost of executing a query is given by the lengths of the inverted lists for the query terms, which increase with the size of the document collection and are often in the range of many megabytes. to address this issue, ir and database researchers have proposed pruning techniques that compute or approximate term-based ranking functions without scanning over the full inverted lists. over the last few years, search engines have incorporated new types of ranking techniques that exploit aspects such as the hyperlink structure of the web or the popularity of a page to obtain improved results. we focus on the question of how such techniques can be efficiently integrated into query processing. in particular, we study ʿҲʶpruning techniques for query execution in large engines in the case where we have a global ranking of pages, as provided by pagerank or any other method, in addition to the standard term-based approach. we describe pruning schemes for this case and evaluate their efficiency on an experimental clusterbased search engine with  million web pages. our results show that there is significant potential benefit in such techniques. citee id:269 citee title:automatic resource list compilation by analyzing hyperlink structure and associated text citee abstract:we describe the design, prototyping and evaluation of arc, a system for automatically compiling a list of authoritative web resources on any (sufficiently broad) topic. the goal of arc is to compile resource lists similar to those provided by yahoo! or infoseek. the fundamental difference is that these services construct lists either manually or through a combination of human and automated effort, while arc operates fully automatically. we describe the evaluation of arc, yahoo!, and infoseek resource lists by a panel of human users. this evaluation suggests that the resources found by arc frequently fare almost as well as, and sometimes better than, lists of resources that are manually compiled or classified into a topic. we also provide examples of arc resource lists for the reader to examine. surrounding text:g. , [6, ***, 22, 24, 25, 33]<2>, that perform link-based ranking either at query time or as a preprocessing step. integrating term-based and other factors: despite the large amount of work on link-based ranking, there is almost no published work on how to efficiently integrate the techniques into a large search engine influence:2 type:3 pair index:813 citer id:675 citer title:optimized query execution in large search engines with global page ordering citer abstract:large web search engines have to answer thousands of queries per second with interactive response times. a major factor in the cost of executing a query is given by the lengths of the inverted lists for the query terms, which increase with the size of the document collection and are often in the range of many megabytes. to address this issue, ir and database researchers have proposed pruning techniques that compute or approximate term-based ranking functions without scanning over the full inverted lists. over the last few years, search engines have incorporated new types of ranking techniques that exploit aspects such as the hyperlink structure of the web or the popularity of a page to obtain improved results. we focus on the question of how such techniques can be efficiently integrated into query processing. in particular, we study ʿҲʶpruning techniques for query execution in large engines in the case where we have a global ranking of pages, as provided by pagerank or any other method, in addition to the standard term-based approach. we describe pruning schemes for this case and evaluate their efficiency on an experimental clusterbased search engine with  million web pages. our results show that there is significant potential benefit in such techniques. citee id:519 citee title:optimizing queries over multimedia repositories citee abstract:repositories of multimedia objects having multiple types of attributes (e.g., image, text) are becoming increasingly common. a selection on these attributes will typically produce not just a set of objects, as in the traditional relational query model (filtering), but also a grade of match associated with each object, indicating how well the object matches the selection condition (ranking). also, multimedia repositories may allow access to the attributes of each object only through indexes. we investigate how to optimize the processing of queries over multimedia repositories. a key issue is the choice of the indexes used to search the repository. we define an execution space that is search-minimal, i.e., the set of indexes searched is minimal. although the general problem of picking an optimal plan in the search-minimal execution space is np-hard, we solve the problem efficiently when the predicates in the query are independent. we also show that the problem of optimizing queries that ask for a few top-ranked objects can be viewed, in many cases, as that of evaluating selection conditions. thus, both problems can be viewed together as an extended filtering problem. surrounding text:g. , [1, 2, ***, 15, 18, 32]<2> for recent work and [38]<0> for an overview of older work. typically, these techniques reorder the inverted lists such that highscoring documents are likely to be at the beginning, and then terminate the search over the inverted lists once most of the high-scoring documents have been scored influence:2 type:2 pair index:814 citer id:675 citer title:optimized query execution in large search engines with global page ordering citer abstract:large web search engines have to answer thousands of queries per second with interactive response times. a major factor in the cost of executing a query is given by the lengths of the inverted lists for the query terms, which increase with the size of the document collection and are often in the range of many megabytes. to address this issue, ir and database researchers have proposed pruning techniques that compute or approximate term-based ranking functions without scanning over the full inverted lists. over the last few years, search engines have incorporated new types of ranking techniques that exploit aspects such as the hyperlink structure of the web or the popularity of a page to obtain improved results. we focus on the question of how such techniques can be efficiently integrated into query processing. in particular, we study ʿҲʶpruning techniques for query execution in large engines in the case where we have a global ranking of pages, as provided by pagerank or any other method, in addition to the standard term-based approach. we describe pruning schemes for this case and evaluate their efficiency on an experimental clusterbased search engine with  million web pages. our results show that there is significant potential benefit in such techniques. citee id:330 citee title:combining fuzzy information from multiple systems citee abstract:in a traditional database system, the result of a query is a set of values (those values that satisfy the query). in other data servers, such as a system with queries based on image content, or many text retrieval systems, the result of a query is a sorted list. for example, in the case of a system with queries based on image content, the query might ask for objects that are a particular shade of red, and the result of the query would be a sorted list of objects in the database, sorted by how ... surrounding text:g. , [1, 2, 14, ***, 18, 32]<2> for recent work and [38]<0> for an overview of older work. typically, these techniques reorder the inverted lists such that highscoring documents are likely to be at the beginning, and then terminate the search over the inverted lists once most of the high-scoring documents have been scored. more recently, there has been a lot of work on top-( queries in the database community. see [16]<0> for a survey and [***, 18]<2> for a formal analysis of some schemes. this work was originally motivated by queries to multimedia databases, e. stated in ir terms, the algorithms also assume that postings in the inverted lists are sorted by their contributions and are accessed in sorted order. however, several of the algorithms proposed in [***, 18, 19, 30, 39]<2> also assume that once a document is encountered in one of the inverted lists, we can efficiently compute its complete score by performing lookups into the other inverted lists. this gives much better pruning than sorted access alone, but in a search engine context it may not be efficient as it results in many random lookups on disk influence:2 type:2 pair index:815 citer id:675 citer title:optimized query execution in large search engines with global page ordering citer abstract:large web search engines have to answer thousands of queries per second with interactive response times. a major factor in the cost of executing a query is given by the lengths of the inverted lists for the query terms, which increase with the size of the document collection and are often in the range of many megabytes. to address this issue, ir and database researchers have proposed pruning techniques that compute or approximate term-based ranking functions without scanning over the full inverted lists. over the last few years, search engines have incorporated new types of ranking techniques that exploit aspects such as the hyperlink structure of the web or the popularity of a page to obtain improved results. we focus on the question of how such techniques can be efficiently integrated into query processing. in particular, we study ʿҲʶpruning techniques for query execution in large engines in the case where we have a global ranking of pages, as provided by pagerank or any other method, in addition to the standard term-based approach. we describe pruning schemes for this case and evaluate their efficiency on an experimental clusterbased search engine with  million web pages. our results show that there is significant potential benefit in such techniques. citee id:331 citee title:combining fuzzy information: an overview citee abstract:assume that each object in a database has m grades, or scores, one for each of m attributesfi surrounding text:more recently, there has been a lot of work on top-( queries in the database community. see [***]<0> for a survey and [15, 18]<2> for a formal analysis of some schemes. this work was originally motivated by queries to multimedia databases, e influence:2 type:3 pair index:816 citer id:675 citer title:optimized query execution in large search engines with global page ordering citer abstract:large web search engines have to answer thousands of queries per second with interactive response times. a major factor in the cost of executing a query is given by the lengths of the inverted lists for the query terms, which increase with the size of the document collection and are often in the range of many megabytes. to address this issue, ir and database researchers have proposed pruning techniques that compute or approximate term-based ranking functions without scanning over the full inverted lists. over the last few years, search engines have incorporated new types of ranking techniques that exploit aspects such as the hyperlink structure of the web or the popularity of a page to obtain improved results. we focus on the question of how such techniques can be efficiently integrated into query processing. in particular, we study ʿҲʶpruning techniques for query execution in large engines in the case where we have a global ranking of pages, as provided by pagerank or any other method, in addition to the standard term-based approach. we describe pruning schemes for this case and evaluate their efficiency on an experimental clusterbased search engine with  million web pages. our results show that there is significant potential benefit in such techniques. citee id:40 citee title:static index pruning for information retrieval systems citee abstract:we introduce static index pruning methods that significantly reduce the index size in information retrieval systems. we investigate uniform and term-based methods that each remove selected entries from the index and yet have only a minor effect on retrieval results. in uniform pruning, there is a fixed cutoff threshold, and all index entries whose contribution to relevance scores is bounded above by a given threshold are removed from the index. in term-based pruning, the cutoff threshold is determined for each term, and thus may vary from term to term. we give experimental evidence that for each level of compression, term-based pruning outperforms uniform pruning, under various measures of precision. we present theoretical and experimental evidence that under our term-based pruning scheme, it is possible to prune the index greatly and still get retrieval results that are almost as good as those based on the full index surrounding text:other work assumes that random lookups are feasible. schemes can be either precise or only compute approximate top-( results [14, 18]<3>, or use precomputation to reduce the lengths of the stored inverted lists [***]<3>. we also note that much of the previous work performs ranking on the union, rather than intersection, of the inverted lists influence:2 type:2 pair index:817 citer id:675 citer title:optimized query execution in large search engines with global page ordering citer abstract:large web search engines have to answer thousands of queries per second with interactive response times. a major factor in the cost of executing a query is given by the lengths of the inverted lists for the query terms, which increase with the size of the document collection and are often in the range of many megabytes. to address this issue, ir and database researchers have proposed pruning techniques that compute or approximate term-based ranking functions without scanning over the full inverted lists. over the last few years, search engines have incorporated new types of ranking techniques that exploit aspects such as the hyperlink structure of the web or the popularity of a page to obtain improved results. we focus on the question of how such techniques can be efficiently integrated into query processing. in particular, we study ʿҲʶpruning techniques for query execution in large engines in the case where we have a global ranking of pages, as provided by pagerank or any other method, in addition to the standard term-based approach. we describe pruning schemes for this case and evaluate their efficiency on an experimental clusterbased search engine with  million web pages. our results show that there is significant potential benefit in such techniques. citee id:254 citee title:optimal aggregation algorithms for middleware citee abstract:assume that each object in a database has m grades, or scores, one for eachof m attributes. for example, an object can have a color grade, that tells how red it is, and a shape grade, that tells how round it is. for each attribute, there is a sorted list, which lists each object and its grade under that attribute, sorted by grade (highest grade first). each object is assigned an overall grade, that is obtained by combining the attribute grades using a fixed monotone aggregation function, or combining rule, suchas min or average. to determine the top k objects, that is, k objects with the highest overall grades, the naive algorithm must access every object in the database, to find its grade under each attribute. fagin has given an algorithm (fagins algorithm, or fa) that is much more efficient. for some monotone aggregation functions, fa is optimal with high probability in the worst case. we analyze an elegant and remarkably simple algorithm (the threshold algorithm, or ta) that is optimal in a much stronger sense than fa. we show that ta is essentially optimal, not just for some monotone aggregation functions, but for all of them, and not just in a high-probability worst-case sense, but over every database. unlike fa, which requires large buffers (whose size may grow unboundedly as the database size grows), ta requires only a small, constant-size buffer. ta allows early stopping, which yields, in a precise sense, an approximate version of the top k answers. we distinguish two types of access: sorted access (where the middleware system obtains the grade of an object in some sorted list by proceeding through the list sequentially from the top), and random access (where the middleware system requests the grade of object in a list, and obtains it in one step). we consider the scenarios where random access is either impossible, or expensive relative to sorted access, and provide algorithms that are essentially optimal for these cases as well. r 2003 elsevier science (usa). all rights reserved. surrounding text:g. , [1, 2, 14, 15, ***, 32]<2> for recent work and [38]<0> for an overview of older work. typically, these techniques reorder the inverted lists such that highscoring documents are likely to be at the beginning, and then terminate the search over the inverted lists once most of the high-scoring documents have been scored. more recently, there has been a lot of work on top-( queries in the database community. see [16]<0> for a survey and [15, ***]<2> for a formal analysis of some schemes. this work was originally motivated by queries to multimedia databases, e. stated in ir terms, the algorithms also assume that postings in the inverted lists are sorted by their contributions and are accessed in sorted order. however, several of the algorithms proposed in [15, ***, 19, 30, 39]<2> also assume that once a document is encountered in one of the inverted lists, we can efficiently compute its complete score by performing lookups into the other inverted lists. this gives much better pruning than sorted access alone, but in a search engine context it may not be efficient as it results in many random lookups on disk. this gives much better pruning than sorted access alone, but in a search engine context it may not be efficient as it results in many random lookups on disk. 5 other schemes [10, ***, 20]<2> work without random lookups or in cases where only some of the lists are sorted. thus, there are a variety of previous pruning schemes, though many have objectives or assumptions that are different from ours influence:1 type:2 pair index:818 citer id:675 citer title:optimized query execution in large search engines with global page ordering citer abstract:large web search engines have to answer thousands of queries per second with interactive response times. a major factor in the cost of executing a query is given by the lengths of the inverted lists for the query terms, which increase with the size of the document collection and are often in the range of many megabytes. to address this issue, ir and database researchers have proposed pruning techniques that compute or approximate term-based ranking functions without scanning over the full inverted lists. over the last few years, search engines have incorporated new types of ranking techniques that exploit aspects such as the hyperlink structure of the web or the popularity of a page to obtain improved results. we focus on the question of how such techniques can be efficiently integrated into query processing. in particular, we study ʿҲʶpruning techniques for query execution in large engines in the case where we have a global ranking of pages, as provided by pagerank or any other method, in addition to the standard term-based approach. we describe pruning schemes for this case and evaluate their efficiency on an experimental clusterbased search engine with  million web pages. our results show that there is significant potential benefit in such techniques. citee id:805 citee title:optimizing multifeature queries in image databases citee abstract:in digital libraries image retrieval queries can be based on the similarity of objects, using several feature attributes like shape, texture, color or text. such multi-feature queries return a ranked result set instead of exact matches. besides, the user wants to see only the k top-ranked objects. we present a new algorithm called quick-combine (european patent pending, nr. ep 00102651.7) for combining multi-feature result lists, guaranteeing the correct retrieval of the k top-ranked results. for score aggregation virtually any combining function can be used, including weighted queries. compared to fagins algorithm we have developed an improved termination condition in tuned combination with a heuristic control flow adopting itself narrowly to the particular score distribution. top-ranked results can be computed and output incrementally. we show that we can dramatically improve performance, in particular for non-uniform score distributions. benchmarks on practical data indicate efficiency gains by a factor of 30. for very skewed data observed speed-up factors are even larger. these performance results scale through different database sizes and numbers of result sets to combine. surrounding text:stated in ir terms, the algorithms also assume that postings in the inverted lists are sorted by their contributions and are accessed in sorted order. however, several of the algorithms proposed in [15, 18, ***, 30, 39]<2> also assume that once a document is encountered in one of the inverted lists, we can efficiently compute its complete score by performing lookups into the other inverted lists. this gives much better pruning than sorted access alone, but in a search engine context it may not be efficient as it results in many random lookups on disk influence:2 type:2 pair index:819 citer id:675 citer title:optimized query execution in large search engines with global page ordering citer abstract:large web search engines have to answer thousands of queries per second with interactive response times. a major factor in the cost of executing a query is given by the lengths of the inverted lists for the query terms, which increase with the size of the document collection and are often in the range of many megabytes. to address this issue, ir and database researchers have proposed pruning techniques that compute or approximate term-based ranking functions without scanning over the full inverted lists. over the last few years, search engines have incorporated new types of ranking techniques that exploit aspects such as the hyperlink structure of the web or the popularity of a page to obtain improved results. we focus on the question of how such techniques can be efficiently integrated into query processing. in particular, we study ʿҲʶpruning techniques for query execution in large engines in the case where we have a global ranking of pages, as provided by pagerank or any other method, in addition to the standard term-based approach. we describe pruning schemes for this case and evaluate their efficiency on an experimental clusterbased search engine with  million web pages. our results show that there is significant potential benefit in such techniques. citee id:669 citee title:towards efficient multi-feature queries in heterogeneous environments citee abstract:applications like multimedia databases or enterprisewideinformation management systems have to meet thechallenge of efficiently retrieving best matching objectsfrom vast collections of datafi we present a new algorithmstream-combine for processing multi-feature queries onheterogeneous data sourcesfi stream-combine is selfadaptingto different data distributions and to the specifickind of the combining functionfi furthermore we present anew retrieval strategy that will essentially speed up surrounding text:this gives much better pruning than sorted access alone, but in a search engine context it may not be efficient as it results in many random lookups on disk. 5 other schemes [10, 18, ***]<2> work without random lookups or in cases where only some of the lists are sorted. thus, there are a variety of previous pruning schemes, though many have objectives or assumptions that are different from ours influence:2 type:2 pair index:820 citer id:675 citer title:optimized query execution in large search engines with global page ordering citer abstract:large web search engines have to answer thousands of queries per second with interactive response times. a major factor in the cost of executing a query is given by the lengths of the inverted lists for the query terms, which increase with the size of the document collection and are often in the range of many megabytes. to address this issue, ir and database researchers have proposed pruning techniques that compute or approximate term-based ranking functions without scanning over the full inverted lists. over the last few years, search engines have incorporated new types of ranking techniques that exploit aspects such as the hyperlink structure of the web or the popularity of a page to obtain improved results. we focus on the question of how such techniques can be efficiently integrated into query processing. in particular, we study ʿҲʶpruning techniques for query execution in large engines in the case where we have a global ranking of pages, as provided by pagerank or any other method, in addition to the standard term-based approach. we describe pruning schemes for this case and evaluate their efficiency on an experimental clusterbased search engine with  million web pages. our results show that there is significant potential benefit in such techniques. citee id:498 citee title:retrieving records from a gigabyte of text on a minicomputer using statistical ranking citee abstract:statistically based ranked retrieval of records using keywords provides many advantages over traditional boolean retrieval methods, especially for end users. this approach to retrieval, however, has not seen widespread use in large operational retrieval systems. to show the feasibility of this retrieval methodology, research was done to produce very fast search techniques using these ranking algorithms, and then to test the results against large databases with many end users. the results show not only response times on the order of 1 and l/2 seconds for 806 megabytes of text, but also very favorable user reaction. novice users were able to consistently obtain good search results after 5 minutes of training. additional work was done to devise new indexing techniques to create inverted files for large databases using a minicomputer. these techniques use no sorting, require a working space of only about 20% of the size of the input text, and produce indices that are about 14% of the input text size. surrounding text:in the ir community, researcher have studied pruning techniques for the fast evaluation of vector space queries since at least the 1980s. some early work is described in [11, ***, 38, 41]<2>. most relevant to our work are the techniques by persin, zobel, and sacks-davis [32]<2> and the follow-up work in [1, 2]<2>, which study effective early termination (pruning) schemes for the cosine measure based on the idea of sorting the postings in the inverted lists by their contribution to the score of the document influence:2 type:3 pair index:821 citer id:675 citer title:optimized query execution in large search engines with global page ordering citer abstract:large web search engines have to answer thousands of queries per second with interactive response times. a major factor in the cost of executing a query is given by the lengths of the inverted lists for the query terms, which increase with the size of the document collection and are often in the range of many megabytes. to address this issue, ir and database researchers have proposed pruning techniques that compute or approximate term-based ranking functions without scanning over the full inverted lists. over the last few years, search engines have incorporated new types of ranking techniques that exploit aspects such as the hyperlink structure of the web or the popularity of a page to obtain improved results. we focus on the question of how such techniques can be efficiently integrated into query processing. in particular, we study ʿҲʶpruning techniques for query execution in large engines in the case where we have a global ranking of pages, as provided by pagerank or any other method, in addition to the standard term-based approach. we describe pruning schemes for this case and evaluate their efficiency on an experimental clusterbased search engine with  million web pages. our results show that there is significant potential benefit in such techniques. citee id:806 citee title:topic-sensitive pagerank citee abstract:in the original pagerank algorithm for improving the rankingof search-query results, a single pagerank vector is computed,using the link structure of the web, to capture therelative "importance" of web pages, independent of any particularsearch queryfi to yield more accurate search results,we propose computing a set of pagerank vectors, biased usinga set of representative topics, to capture more accuratelythe notion of importance with respect to a particular topicfiby using these surrounding text:g. , [6, 13, ***, 24, 25, 33]<2>, that perform link-based ranking either at query time or as a preprocessing step. integrating term-based and other factors: despite the large amount of work on link-based ranking, there is almost no published work on how to efficiently integrate the techniques into a large search engine. it appears that several search engines are already using techniques for dealing with high loads by modifying query execution [7]<1>, though no details have been published. second, we believe our results are interesting in the context of the ongoing discussion about different approaches to link analysis, in particular the issues of preprocessingbased [***, 31, 33]<2> vs. online [24, 25]<2> techniques, and global [31]<2> vs. online [24, 25]<2> techniques, and global [31]<2> vs. topic-specific [***]<2> vs. query-specific [24, 25, 33]<2> techniques. on the other hand, we also plan to evaluate our techniques using the trec web data set, in order to explore the trade-off between efficiency and the result quality in terms of precision and recall. other problems for future research are the impact of using distances between terms in the document, and topic-specific link analysis techniques such as [***, 33]<2>. in addition, there are several loose ends in our experimental evaluation that we plan to resolve first. a large amount of recent work has focused on link-based ranking and analysis schemes. see [6, ***, 24, 25, 31, 33]<2> for a small sample. previous work on pruning techniques for top-( queries can be divided into two fairly disjoint sets of literature influence:1 type:2 pair index:822 citer id:675 citer title:optimized query execution in large search engines with global page ordering citer abstract:large web search engines have to answer thousands of queries per second with interactive response times. a major factor in the cost of executing a query is given by the lengths of the inverted lists for the query terms, which increase with the size of the document collection and are often in the range of many megabytes. to address this issue, ir and database researchers have proposed pruning techniques that compute or approximate term-based ranking functions without scanning over the full inverted lists. over the last few years, search engines have incorporated new types of ranking techniques that exploit aspects such as the hyperlink structure of the web or the popularity of a page to obtain improved results. we focus on the question of how such techniques can be efficiently integrated into query processing. in particular, we study ʿҲʶpruning techniques for query execution in large engines in the case where we have a global ranking of pages, as provided by pagerank or any other method, in addition to the standard term-based approach. we describe pruning schemes for this case and evaluate their efficiency on an experimental clusterbased search engine with  million web pages. our results show that there is significant potential benefit in such techniques. citee id:664 citee title:inverted file partitioning schemes in multiple disk systems citee abstract:multiple-disk i/o systems (disk arrays) have been an attractive approach to meet highperformance i/o demands in data intensive applications such as information retrieval systemsfiwhen we partition and distribute files across multiple disks to exploit the potential for i/oparallelism, a balanced i/o workload distribution becomes important for good performancefinaturally, the performance of a parallel information retrieval system using an inverted filestructure is affected by the partitioning surrounding text:g. , in [4, 12, ***, 28, 37]<1>. a large amount of recent work has focused on link-based ranking and analysis schemes influence:2 type:3 pair index:823 citer id:675 citer title:optimized query execution in large search engines with global page ordering citer abstract:large web search engines have to answer thousands of queries per second with interactive response times. a major factor in the cost of executing a query is given by the lengths of the inverted lists for the query terms, which increase with the size of the document collection and are often in the range of many megabytes. to address this issue, ir and database researchers have proposed pruning techniques that compute or approximate term-based ranking functions without scanning over the full inverted lists. over the last few years, search engines have incorporated new types of ranking techniques that exploit aspects such as the hyperlink structure of the web or the popularity of a page to obtain improved results. we focus on the question of how such techniques can be efficiently integrated into query processing. in particular, we study ʿҲʶpruning techniques for query execution in large engines in the case where we have a global ranking of pages, as provided by pagerank or any other method, in addition to the standard term-based approach. we describe pruning schemes for this case and evaluate their efficiency on an experimental clusterbased search engine with  million web pages. our results show that there is significant potential benefit in such techniques. citee id:241 citee title:authoritative sources in a hyperlinked environment citee abstract:the network structure of a hyperlinked environment can be a rich source of in-formation about the content of the environment, provided we have effective means forunderstanding it. we develop a set of algorithmic tools for extracting information fromthe link structures of such environments, and report on experiments that demonstratetheir effectiveness in a variety of contexts on the world wide web. the central issuewe address within our framework is the distillation of broad search topics, throughthe discovery of authoritative information sources on such topics. we propose andtest an algorithmic formulation of the notion of authority, based on the relationshipbetween a set of relevant authoritative pages and the set of hub pages that join themtogether in the link structure. our formulation has connections to the eigenvectorsof certain matrices associated with the link graph; these connections in turn motivateadditional heuristics for link-based analysis. surrounding text:g. , [6, 13, 22, ***, 25, 33]<2>, that perform link-based ranking either at query time or as a preprocessing step. integrating term-based and other factors: despite the large amount of work on link-based ranking, there is almost no published work on how to efficiently integrate the techniques into a large search engine. second, we believe our results are interesting in the context of the ongoing discussion about different approaches to link analysis, in particular the issues of preprocessingbased [22, 31, 33]<2> vs. online [***, 25]<2> techniques, and global [31]<2> vs. topic-specific [22]<2> vs. topic-specific [22]<2> vs. query-specific [***, 25, 33]<2> techniques. our results indicate that a global precomputed ordering is highly desirable from a performance point of view, as it allows optimized index layout and pruning. a large amount of recent work has focused on link-based ranking and analysis schemes. see [6, 22, ***, 25, 31, 33]<2> for a small sample. previous work on pruning techniques for top-( queries can be divided into two fairly disjoint sets of literature influence:1 type:2 pair index:824 citer id:675 citer title:optimized query execution in large search engines with global page ordering citer abstract:large web search engines have to answer thousands of queries per second with interactive response times. a major factor in the cost of executing a query is given by the lengths of the inverted lists for the query terms, which increase with the size of the document collection and are often in the range of many megabytes. to address this issue, ir and database researchers have proposed pruning techniques that compute or approximate term-based ranking functions without scanning over the full inverted lists. over the last few years, search engines have incorporated new types of ranking techniques that exploit aspects such as the hyperlink structure of the web or the popularity of a page to obtain improved results. we focus on the question of how such techniques can be efficiently integrated into query processing. in particular, we study ʿҲʶpruning techniques for query execution in large engines in the case where we have a global ranking of pages, as provided by pagerank or any other method, in addition to the standard term-based approach. we describe pruning schemes for this case and evaluate their efficiency on an experimental clusterbased search engine with  million web pages. our results show that there is significant potential benefit in such techniques. citee id:807 citee title:the stochastic approach for link-structure analysis (salsa) and the tkc effect citee abstract:today, when searching for information on the www, one usually performsa query through a term-based search enginefi these engines return, as thequery"s result, a list of web sites whose contents matches the queryfi forbroad topic queries, such searches often result in a huge set of retrieveddocuments, many of which are irrelevant to the userfi however, much informationis contained in the link-structure of the wwwfi information such aswhich pages are linked to others can be used to augment surrounding text:g. , [6, 13, 22, 24, ***, 33]<2>, that perform link-based ranking either at query time or as a preprocessing step. integrating term-based and other factors: despite the large amount of work on link-based ranking, there is almost no published work on how to efficiently integrate the techniques into a large search engine. second, we believe our results are interesting in the context of the ongoing discussion about different approaches to link analysis, in particular the issues of preprocessingbased [22, 31, 33]<2> vs. online [24, ***]<2> techniques, and global [31]<2> vs. topic-specific [22]<2> vs. topic-specific [22]<2> vs. query-specific [24, ***, 33]<2> techniques. our results indicate that a global precomputed ordering is highly desirable from a performance point of view, as it allows optimized index layout and pruning. a large amount of recent work has focused on link-based ranking and analysis schemes. see [6, 22, 24, ***, 31, 33]<2> for a small sample. previous work on pruning techniques for top-( queries can be divided into two fairly disjoint sets of literature influence:1 type:2 pair index:825 citer id:675 citer title:optimized query execution in large search engines with global page ordering citer abstract:large web search engines have to answer thousands of queries per second with interactive response times. a major factor in the cost of executing a query is given by the lengths of the inverted lists for the query terms, which increase with the size of the document collection and are often in the range of many megabytes. to address this issue, ir and database researchers have proposed pruning techniques that compute or approximate term-based ranking functions without scanning over the full inverted lists. over the last few years, search engines have incorporated new types of ranking techniques that exploit aspects such as the hyperlink structure of the web or the popularity of a page to obtain improved results. we focus on the question of how such techniques can be efficiently integrated into query processing. in particular, we study ʿҲʶpruning techniques for query execution in large engines in the case where we have a global ranking of pages, as provided by pagerank or any other method, in addition to the standard term-based approach. we describe pruning schemes for this case and evaluate their efficiency on an experimental clusterbased search engine with  million web pages. our results show that there is significant potential benefit in such techniques. citee id:808 citee title:optimizing result prefetching in web search engines with segmented indices citee abstract:we study the process in which search engineswith segmented indices serve queries. in par-ticular, we investigate the number of resultpages which search engines should prepareduring the query processing phase.search engine users have been observed tobrowse through very few pages of results forqueries which they submit. this behavior ofusers suggests that prefetching many resultsupon processing an initial query is not e-cient, since most of the prefetched results willnot be requested by the user who initiatedthesearch. however, a policy which abandons re-sult prefetching in favor of retrieving just therst page of search results mightnot makeop-timal use of system resources as well.we argue that for a certain behavior of users,engines should prefetch a constant numberof result pages per query. we dene a con-crete query processing model for search en-gines with segmented indices, and analyze thecost of such prefetching policies. based onthese costs, we show how to determine theconstant which optimizes the prefetching pol-icy. our results are mostly applicable tolocalindexpartitions of the inverted les, but arealso applicable to processing of short queriesinglobal indexarchitectures. surrounding text:another machine acts as a query integrator that receives queries from the users, broadcasts them, and then merges the returned results into a proper ranking that is sent back to the user. see [***]<1> for details. thus, one way to answer a top query is to ask for the top  results from each machine. we intend to include these results in a longer journal version of this paper. 4 discussion of related work for background on indexing and query execution in ir and search engines, we refer to [3, 5, 40]<1>, and for basics of parallel search engine architecture we refer to [7, 8, ***, 34]<1>. discussions and comparisons of local and global index partitioning schemes and their performance are given, e influence:2 type:3 pair index:826 citer id:675 citer title:optimized query execution in large search engines with global page ordering citer abstract:large web search engines have to answer thousands of queries per second with interactive response times. a major factor in the cost of executing a query is given by the lengths of the inverted lists for the query terms, which increase with the size of the document collection and are often in the range of many megabytes. to address this issue, ir and database researchers have proposed pruning techniques that compute or approximate term-based ranking functions without scanning over the full inverted lists. over the last few years, search engines have incorporated new types of ranking techniques that exploit aspects such as the hyperlink structure of the web or the popularity of a page to obtain improved results. we focus on the question of how such techniques can be efficiently integrated into query processing. in particular, we study ʿҲʶpruning techniques for query execution in large engines in the case where we have a global ranking of pages, as provided by pagerank or any other method, in addition to the standard term-based approach. we describe pruning schemes for this case and evaluate their efficiency on an experimental clusterbased search engine with  million web pages. our results show that there is significant potential benefit in such techniques. citee id:809 citee title:scalable distributed architectures for information retrieval citee abstract:as information explodes across the internet and intranets, information retrieval (ir) systems must cope with the challenge of scale. how to provide scalable performance for rapidly increasing data and workloads is critical in the design of next generation information retrieval systems. this dissertation studies scalable distributed ir architectures that not only provide quick response but also maintain acceptable retrieval accuracy. our distributed architectures exploit parallelism in information retrieval on a cluster of parallel ir servers using symmetric multiprocessors, and use partial collection replication and selection as well as collection selection to restrict the search to a small percentage of data while maintaining retrieval accuracy. we first investigate using partial collection replication for ir systems. we examine query locality in real systems, how to select a partial replica based on relevance, how to load-balance between replicas and the original collection, as well as updating overheads and strategies. our results show that there exists sufficient query locality to justify partial replication for information retrieval. our proposed replica selection algorithm effectively selects relevant partial replicas, and is inexpensive to implement. our evidence also indicates that partial replication achieves better performance than caching queries, because the replica selection algorithm finds similarity between non-identical queries, and thus increases observed locality. we use a validated simulator to perform a detailed performance evaluation of distributed ir architectures. we explore how best to build parallel ir servers using symmetric multiprocessors, evaluate the performance of partial collection replication and collection selection, and compare the performance of partial collection replication with collection partitioning as well as collection selection. surrounding text:however, to our knowledge several of the major engines still scan the entire inverted lists for most queries. with the exception of [***]<1>, we are not aware of any previous large-scale study on query throughput in large engines under web query loads. 5 pruning techniques we now describe the different pruning techniques that we study in this paper influence:2 type:2 pair index:827 citer id:675 citer title:optimized query execution in large search engines with global page ordering citer abstract:large web search engines have to answer thousands of queries per second with interactive response times. a major factor in the cost of executing a query is given by the lengths of the inverted lists for the query terms, which increase with the size of the document collection and are often in the range of many megabytes. to address this issue, ir and database researchers have proposed pruning techniques that compute or approximate term-based ranking functions without scanning over the full inverted lists. over the last few years, search engines have incorporated new types of ranking techniques that exploit aspects such as the hyperlink structure of the web or the popularity of a page to obtain improved results. we focus on the question of how such techniques can be efficiently integrated into query processing. in particular, we study ʿҲʶpruning techniques for query execution in large engines in the case where we have a global ranking of pages, as provided by pagerank or any other method, in addition to the standard term-based approach. we describe pruning schemes for this case and evaluate their efficiency on an experimental clusterbased search engine with  million web pages. our results show that there is significant potential benefit in such techniques. citee id:290 citee title:building a distributed full-text index for the web citee abstract:we identify crucial design issues in building a distributed inverted index for a large collection of web pages. we introduce a novel pipelining technique for structuring the core index-building system that substantially reduces the index construction time. we also propose a storage scheme for creating and managing inverted files using an embedded database system. we suggest and compare different strategies for collecting global statistics from distributed inverted indexes. finally, we present performance results from experiments on a testbed distributed web indexing system that we have implemented. surrounding text:2 each scheme has advantages and disadvantages that we do not have space to discuss here. see [4, ***, 37]<2>. in this paper, we assume a local index organization, although some of the ideas also apply to a global index influence:3 type:2 pair index:828 citer id:675 citer title:optimized query execution in large search engines with global page ordering citer abstract:large web search engines have to answer thousands of queries per second with interactive response times. a major factor in the cost of executing a query is given by the lengths of the inverted lists for the query terms, which increase with the size of the document collection and are often in the range of many megabytes. to address this issue, ir and database researchers have proposed pruning techniques that compute or approximate term-based ranking functions without scanning over the full inverted lists. over the last few years, search engines have incorporated new types of ranking techniques that exploit aspects such as the hyperlink structure of the web or the popularity of a page to obtain improved results. we focus on the question of how such techniques can be efficiently integrated into query processing. in particular, we study ʿҲʶpruning techniques for query execution in large engines in the case where we have a global ranking of pages, as provided by pagerank or any other method, in addition to the standard term-based approach. we describe pruning schemes for this case and evaluate their efficiency on an experimental clusterbased search engine with  million web pages. our results show that there is significant potential benefit in such techniques. citee id:289 citee title:breadth-first search crawling yields high-quality pages citee abstract:this paper examines the average page quality over time of pages downloaded during a web crawl of 328 million unique pagesfi we use the connectivity-based metric pagerank to measure the quality of a pagefi we show that traversing the web graph in breadth-first search order is a good crawling strategy, as it tends to discover high-quality pages early on in the crawlfi surrounding text:the crawl started at a hundred homepages of us universities, and was performed in a breadth-first manner. 7 as observed in [***]<3>, such a crawl will quickly find most pages with significant pagerank value. the total uncompressed size of the data was around  influence:3 type:3 pair index:829 citer id:675 citer title:optimized query execution in large search engines with global page ordering citer abstract:large web search engines have to answer thousands of queries per second with interactive response times. a major factor in the cost of executing a query is given by the lengths of the inverted lists for the query terms, which increase with the size of the document collection and are often in the range of many megabytes. to address this issue, ir and database researchers have proposed pruning techniques that compute or approximate term-based ranking functions without scanning over the full inverted lists. over the last few years, search engines have incorporated new types of ranking techniques that exploit aspects such as the hyperlink structure of the web or the popularity of a page to obtain improved results. we focus on the question of how such techniques can be efficiently integrated into query processing. in particular, we study ʿҲʶpruning techniques for query execution in large engines in the case where we have a global ranking of pages, as provided by pagerank or any other method, in addition to the standard term-based approach. we describe pruning schemes for this case and evaluate their efficiency on an experimental clusterbased search engine with  million web pages. our results show that there is significant potential benefit in such techniques. citee id:527 citee title:query processing issues in image (multimedia) databases citee abstract:multimedia databases have attracted academic and industrialinterest, and systems such as qbic (content basedimage retrieval system from ibm) have been releasedfi suchsystems are essential to effectively and efficiently use theexisting large collections of image data in the modern computingenvironmentfi the aim of such systems is to enableretrieval of images based on their contentsfi this problemhas brought together the (decades old) database and imageprocessing communitiesfias part of our surrounding text:stated in ir terms, the algorithms also assume that postings in the inverted lists are sorted by their contributions and are accessed in sorted order. however, several of the algorithms proposed in [15, 18, 19, ***, 39]<2> also assume that once a document is encountered in one of the inverted lists, we can efficiently compute its complete score by performing lookups into the other inverted lists. this gives much better pruning than sorted access alone, but in a search engine context it may not be efficient as it results in many random lookups on disk influence:2 type:2 pair index:830 citer id:675 citer title:optimized query execution in large search engines with global page ordering citer abstract:large web search engines have to answer thousands of queries per second with interactive response times. a major factor in the cost of executing a query is given by the lengths of the inverted lists for the query terms, which increase with the size of the document collection and are often in the range of many megabytes. to address this issue, ir and database researchers have proposed pruning techniques that compute or approximate term-based ranking functions without scanning over the full inverted lists. over the last few years, search engines have incorporated new types of ranking techniques that exploit aspects such as the hyperlink structure of the web or the popularity of a page to obtain improved results. we focus on the question of how such techniques can be efficiently integrated into query processing. in particular, we study ʿҲʶpruning techniques for query execution in large engines in the case where we have a global ranking of pages, as provided by pagerank or any other method, in addition to the standard term-based approach. we describe pruning schemes for this case and evaluate their efficiency on an experimental clusterbased search engine with  million web pages. our results show that there is significant potential benefit in such techniques. citee id:748 citee title:the pagerank citation ranking: bringing order to the web citee abstract:the importance of a webpage is an inherently subjective matter, which depends on the readers interests, knowledge and attitudes. but there is still much that can be said objectively about the relative importance of web pages. this paper describes pagerank, a method for rating web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them. we compare pagerank to an idealized random websurfer. we show how to efficiently compute pagerank for large numbers of pages. and,we show how to apply pagerank to search and to user navigation. surrounding text:it appears that several search engines are already using techniques for dealing with high loads by modifying query execution [7]<1>, though no details have been published. second, we believe our results are interesting in the context of the ongoing discussion about different approaches to link analysis, in particular the issues of preprocessingbased [22, ***, 33]<2> vs. online [24, 25]<2> techniques, and global [***]<2> vs. second, we believe our results are interesting in the context of the ongoing discussion about different approaches to link analysis, in particular the issues of preprocessingbased [22, ***, 33]<2> vs. online [24, 25]<2> techniques, and global [***]<2> vs. topic-specific [22]<2> vs. a large amount of recent work has focused on link-based ranking and analysis schemes. see [6, 22, 24, 25, ***, 33]<2> for a small sample. previous work on pruning techniques for top-( queries can be divided into two fairly disjoint sets of literature influence:2 type:2 pair index:831 citer id:675 citer title:optimized query execution in large search engines with global page ordering citer abstract:large web search engines have to answer thousands of queries per second with interactive response times. a major factor in the cost of executing a query is given by the lengths of the inverted lists for the query terms, which increase with the size of the document collection and are often in the range of many megabytes. to address this issue, ir and database researchers have proposed pruning techniques that compute or approximate term-based ranking functions without scanning over the full inverted lists. over the last few years, search engines have incorporated new types of ranking techniques that exploit aspects such as the hyperlink structure of the web or the popularity of a page to obtain improved results. we focus on the question of how such techniques can be efficiently integrated into query processing. in particular, we study ʿҲʶpruning techniques for query execution in large engines in the case where we have a global ranking of pages, as provided by pagerank or any other method, in addition to the standard term-based approach. we describe pruning schemes for this case and evaluate their efficiency on an experimental clusterbased search engine with  million web pages. our results show that there is significant potential benefit in such techniques. citee id:44 citee title:filtered document retrieval with frequency-sorted indexes citee abstract:ranking techniques are e ective at finding answers in document collections but can be expensive to evaluate. we propose an evaluation technique that uses early recognition of which documents are likely to be highly ranked to reduce costs; for our test data, queries are evaluated in 2% of the memory of the standard implementation without degradation in retrieval e ectiveness. cpu time and disk traffic can also be dramatically reduced by designing inverted indexes explicitly to support the technique. the principle of the index design is that inverted lists are sorted by decreasing within-document frequency rather than by document number, and this method experimentally reduces cpu time and disk traffic to around one third of the original requirement. we also show that frequency sorting can lead to a net reduction in index size, regardless of whether the index is compressed surrounding text:g. , [1, 2, 14, 15, 18, ***]<2> for recent work and [38]<0> for an overview of older work. typically, these techniques reorder the inverted lists such that highscoring documents are likely to be at the beginning, and then terminate the search over the inverted lists once most of the high-scoring documents have been scored. some early work is described in [11, 21, 38, 41]<2>. most relevant to our work are the techniques by persin, zobel, and sacks-davis [***]<2> and the follow-up work in [1, 2]<2>, which study effective early termination (pruning) schemes for the cosine measure based on the idea of sorting the postings in the inverted lists by their contribution to the score of the document. a scheme in [***]<2> also proposed to partition the inverted list into several partitions, somewhat reminiscent of our scheme in subsection 5. most relevant to our work are the techniques by persin, zobel, and sacks-davis [***]<2> and the follow-up work in [1, 2]<2>, which study effective early termination (pruning) schemes for the cosine measure based on the idea of sorting the postings in the inverted lists by their contribution to the score of the document. a scheme in [***]<2> also proposed to partition the inverted list into several partitions, somewhat reminiscent of our scheme in subsection 5. 2, but for the purpose of achieving good compression of the index, rather than to integrate a global page ordering. g. , between and fi  fi in [***]<2>), or concerned with main memory consumption or cpu resources for evaluating the cosine measure. other work assumes that random lookups are feasible influence:1 type:2 pair index:832 citer id:675 citer title:optimized query execution in large search engines with global page ordering citer abstract:large web search engines have to answer thousands of queries per second with interactive response times. a major factor in the cost of executing a query is given by the lengths of the inverted lists for the query terms, which increase with the size of the document collection and are often in the range of many megabytes. to address this issue, ir and database researchers have proposed pruning techniques that compute or approximate term-based ranking functions without scanning over the full inverted lists. over the last few years, search engines have incorporated new types of ranking techniques that exploit aspects such as the hyperlink structure of the web or the popularity of a page to obtain improved results. we focus on the question of how such techniques can be efficiently integrated into query processing. in particular, we study ʿҲʶpruning techniques for query execution in large engines in the case where we have a global ranking of pages, as provided by pagerank or any other method, in addition to the standard term-based approach. we describe pruning schemes for this case and evaluate their efficiency on an experimental clusterbased search engine with  million web pages. our results show that there is significant potential benefit in such techniques. citee id:810 citee title:the intelligent surfer: probabilistic combination of link and content information in pagerank citee abstract:the pagerank algorithm, used in the google search engine, greatly improves the results of web search by taking into account the link structure of the webfi pagerank assigns to a page a score propor-tional to the number of times a random surfer would visit that page, if it surfed indefinitely from page to page, following all outlinks from a page with equal probabilityfi we propose to improve page-rank by using a more intelligent surfer, one that is guided by a probabilistic model of the relevance of a page to a queryfi efficient execution of our algorithm at query time is made possible by pre-computing at crawl time (and thus once for all queries) the neces-sary termsfi experiments on two large subsets of the web indicate that our algorithm significantly outperforms pagerank in the (hu-man-rated) quality of the pages returned, while remaining efficient enough to be used in today's large search enginesfi surrounding text:g. , [6, 13, 22, 24, 25, ***]<2>, that perform link-based ranking either at query time or as a preprocessing step. integrating term-based and other factors: despite the large amount of work on link-based ranking, there is almost no published work on how to efficiently integrate the techniques into a large search engine. it appears that several search engines are already using techniques for dealing with high loads by modifying query execution [7]<1>, though no details have been published. second, we believe our results are interesting in the context of the ongoing discussion about different approaches to link analysis, in particular the issues of preprocessingbased [22, 31, ***]<2> vs. online [24, 25]<2> techniques, and global [31]<2> vs. topic-specific [22]<2> vs. query-specific [24, 25, ***]<2> techniques. our results indicate that a global precomputed ordering is highly desirable from a performance point of view, as it allows optimized index layout and pruning. on the other hand, we also plan to evaluate our techniques using the trec web data set, in order to explore the trade-off between efficiency and the result quality in terms of precision and recall. other problems for future research are the impact of using distances between terms in the document, and topic-specific link analysis techniques such as [22, ***]<2>. in addition, there are several loose ends in our experimental evaluation that we plan to resolve first. a large amount of recent work has focused on link-based ranking and analysis schemes. see [6, 22, 24, 25, 31, ***]<2> for a small sample. previous work on pruning techniques for top-( queries can be divided into two fairly disjoint sets of literature influence:2 type:2 pair index:833 citer id:675 citer title:optimized query execution in large search engines with global page ordering citer abstract:large web search engines have to answer thousands of queries per second with interactive response times. a major factor in the cost of executing a query is given by the lengths of the inverted lists for the query terms, which increase with the size of the document collection and are often in the range of many megabytes. to address this issue, ir and database researchers have proposed pruning techniques that compute or approximate term-based ranking functions without scanning over the full inverted lists. over the last few years, search engines have incorporated new types of ranking techniques that exploit aspects such as the hyperlink structure of the web or the popularity of a page to obtain improved results. we focus on the question of how such techniques can be efficiently integrated into query processing. in particular, we study ʿҲʶpruning techniques for query execution in large engines in the case where we have a global ranking of pages, as provided by pagerank or any other method, in addition to the standard term-based approach. we describe pruning schemes for this case and evaluate their efficiency on an experimental clusterbased search engine with  million web pages. our results show that there is significant potential benefit in such techniques. citee id:811 citee title:search engines and web dynamics citee abstract:in this paper we study several dimensions of web dynamics in the context of large-scale internetsearch enginesfi both growth and update dynamics clearly represent big challenges for searchenginesfi we show how the problems arise in all components of a reference search engine modelfi surrounding text:we intend to include these results in a longer journal version of this paper. 4 discussion of related work for background on indexing and query execution in ir and search engines, we refer to [3, 5, 40]<1>, and for basics of parallel search engine architecture we refer to [7, 8, 26, ***]<1>. discussions and comparisons of local and global index partitioning schemes and their performance are given, e influence:2 type:3 pair index:834 citer id:675 citer title:optimized query execution in large search engines with global page ordering citer abstract:large web search engines have to answer thousands of queries per second with interactive response times. a major factor in the cost of executing a query is given by the lengths of the inverted lists for the query terms, which increase with the size of the document collection and are often in the range of many megabytes. to address this issue, ir and database researchers have proposed pruning techniques that compute or approximate term-based ranking functions without scanning over the full inverted lists. over the last few years, search engines have incorporated new types of ranking techniques that exploit aspects such as the hyperlink structure of the web or the popularity of a page to obtain improved results. we focus on the question of how such techniques can be efficiently integrated into query processing. in particular, we study ʿҲʶpruning techniques for query execution in large engines in the case where we have a global ranking of pages, as provided by pagerank or any other method, in addition to the standard term-based approach. we describe pruning schemes for this case and evaluate their efficiency on an experimental clusterbased search engine with  million web pages. our results show that there is significant potential benefit in such techniques. citee id:396 citee title:design and implementation of a high-performance distributed web crawler citee abstract:broad web search engines as well as many more specialized search tools rely on webcrawlers to acquire large collections of pages for indexing and analysisfi such a web crawlermay interact with millions of hosts over a period of weeks or months, and thus issues of robustness,flexibility, and manageability are of major importancefi in addition, i/o performance,network resources, and os limits must be taken into account in order to achieve high performanceat a reasonable costfiin this surrounding text:software and data sets: our experiments were run on a search engine prototype, named pingo, that is currently being developed in our research group. the document collection consisted of about    million web pages crawled by the polybot web crawler [***]<3> in october of 2002. not all of the pages are distinct and the set contains a significant number of duplicates due to pages being repeatedly downloaded because of crawl interruptions influence:3 type:3 pair index:835 citer id:675 citer title:optimized query execution in large search engines with global page ordering citer abstract:large web search engines have to answer thousands of queries per second with interactive response times. a major factor in the cost of executing a query is given by the lengths of the inverted lists for the query terms, which increase with the size of the document collection and are often in the range of many megabytes. to address this issue, ir and database researchers have proposed pruning techniques that compute or approximate term-based ranking functions without scanning over the full inverted lists. over the last few years, search engines have incorporated new types of ranking techniques that exploit aspects such as the hyperlink structure of the web or the popularity of a page to obtain improved results. we focus on the question of how such techniques can be efficiently integrated into query processing. in particular, we study ʿҲʶpruning techniques for query execution in large engines in the case where we have a global ranking of pages, as provided by pagerank or any other method, in addition to the standard term-based approach. we describe pruning schemes for this case and evaluate their efficiency on an experimental clusterbased search engine with  million web pages. our results show that there is significant potential benefit in such techniques. citee id:771 citee title:odissea: a peer-to-peer architecture for scalable web search and information retrieval citee abstract:we consider the problem of building a p2p-based search engine for massive document collections. we describe a prototype system called odissea (open distributed search engine architecture) that is currently under development in our group. odissea provides a highly distributed global indexing and query execution service that can be used for content residing inside or outside of a p2p network. odissea is different from many other approaches to p2p search in that it assumes a two-tier search engine architecture and a global index structure distributed over the network. we give an overview of the proposed system and discuss the basic design choices. our main focus is on efficient query execution, and we discuss how recent work on top- queries in the database community can be applied in a highly distributed environment. we also give preliminary simulation results on a real search engine log and a terabyte web collection that indicate good scalability for our approach. surrounding text:however, any other global ordering would also work. for simplicity, we 5the situation may be different in highly distributed environments with limited bandwidth [***]<2>. describe all techniques for the case of  query terms influence:3 type:3 pair index:836 citer id:675 citer title:optimized query execution in large search engines with global page ordering citer abstract:large web search engines have to answer thousands of queries per second with interactive response times. a major factor in the cost of executing a query is given by the lengths of the inverted lists for the query terms, which increase with the size of the document collection and are often in the range of many megabytes. to address this issue, ir and database researchers have proposed pruning techniques that compute or approximate term-based ranking functions without scanning over the full inverted lists. over the last few years, search engines have incorporated new types of ranking techniques that exploit aspects such as the hyperlink structure of the web or the popularity of a page to obtain improved results. we focus on the question of how such techniques can be efficiently integrated into query processing. in particular, we study ʿҲʶpruning techniques for query execution in large engines in the case where we have a global ranking of pages, as provided by pagerank or any other method, in addition to the standard term-based approach. we describe pruning schemes for this case and evaluate their efficiency on an experimental clusterbased search engine with  million web pages. our results show that there is significant potential benefit in such techniques. citee id:812 citee title:performance of inverted indices in distributed text document retrieval systems citee abstract:the performance of distributed text document retrieval systems is strongly influenced bythe organization of the inverted indexfi this paper compares the performance impact on queryprocessing of various physical organizations for inverted listsfi we present a new probabilisticmodel of the database and queriesfi simulation experiments determine which variables moststrongly influence response time and throughputfi this leads to a set of design trade-offs over awide range of hardware surrounding text:2 each scheme has advantages and disadvantages that we do not have space to discuss here. see [4, 28, ***]<2>. in this paper, we assume a local index organization, although some of the ideas also apply to a global index. our experiments show the limits and benefits of the proposed techniques in terms of query throughput and response time for varying levels of concurrency. 2in addition, there are hybrid schemes [***]<2>, and replication is often used in conjunction with partitioning for better performance. 3 influence:2 type:3,2 pair index:837 citer id:675 citer title:optimized query execution in large search engines with global page ordering citer abstract:large web search engines have to answer thousands of queries per second with interactive response times. a major factor in the cost of executing a query is given by the lengths of the inverted lists for the query terms, which increase with the size of the document collection and are often in the range of many megabytes. to address this issue, ir and database researchers have proposed pruning techniques that compute or approximate term-based ranking functions without scanning over the full inverted lists. over the last few years, search engines have incorporated new types of ranking techniques that exploit aspects such as the hyperlink structure of the web or the popularity of a page to obtain improved results. we focus on the question of how such techniques can be efficiently integrated into query processing. in particular, we study ʿҲʶpruning techniques for query execution in large engines in the case where we have a global ranking of pages, as provided by pagerank or any other method, in addition to the standard term-based approach. we describe pruning schemes for this case and evaluate their efficiency on an experimental clusterbased search engine with  million web pages. our results show that there is significant potential benefit in such techniques. citee id:723 citee title:using fagins algorithm for merging ranked results in multimedia middleware citee abstract:a distributed multimedia information system allows applications to access a variety of data, of different modalities, stored in data sources with their own specialized search capabilities. in such a system, the user can request that a set of objects be ranked by a particular property, or by a combination of properties. in , fagin gives an algorithm for efficiently merging multiple ordered streams of ranked results, to form a new stream ordered by a combination of those ranks surrounding text:stated in ir terms, the algorithms also assume that postings in the inverted lists are sorted by their contributions and are accessed in sorted order. however, several of the algorithms proposed in [15, 18, 19, 30, ***]<2> also assume that once a document is encountered in one of the inverted lists, we can efficiently compute its complete score by performing lookups into the other inverted lists. this gives much better pruning than sorted access alone, but in a search engine context it may not be efficient as it results in many random lookups on disk influence:3 type:2 pair index:838 citer id:675 citer title:optimized query execution in large search engines with global page ordering citer abstract:large web search engines have to answer thousands of queries per second with interactive response times. a major factor in the cost of executing a query is given by the lengths of the inverted lists for the query terms, which increase with the size of the document collection and are often in the range of many megabytes. to address this issue, ir and database researchers have proposed pruning techniques that compute or approximate term-based ranking functions without scanning over the full inverted lists. over the last few years, search engines have incorporated new types of ranking techniques that exploit aspects such as the hyperlink structure of the web or the popularity of a page to obtain improved results. we focus on the question of how such techniques can be efficiently integrated into query processing. in particular, we study ʿҲʶpruning techniques for query execution in large engines in the case where we have a global ranking of pages, as provided by pagerank or any other method, in addition to the standard term-based approach. we describe pruning schemes for this case and evaluate their efficiency on an experimental clusterbased search engine with  million web pages. our results show that there is significant potential benefit in such techniques. citee id:343 citee title:managing gigabytes: compressing and indexing documents and images citee abstract:in this fully updated second edition of the highly acclaimed managing gigabytes, authors witten, moffat, and bell continue to provide unparalleled coverage of state-of-the-art techniques for compressing and indexing data. whatever your field, if you work with large quantities of information, this book is essential reading--an authoritative theoretical resource and a practical guide to meeting the toughest storage and access challenges. it covers the latest developments in compression and indexing and their application on the web and in digital libraries. it also details dozens of powerful techniques supported by mg, the authors' own system for compressing, storing, and retrieving text, images, and textual images. mg's source code is freely available on the web. surrounding text:, new york) can be answered by looking at the positions of the two words. we refer to [***]<1> for more details. term-based ranking: the most common way to perform ranking in ir systems is based on comparing the words (terms) contained in the document and in the query. many different functions have been proposed, and the techniques in this paper are not limited to any particular ranking function, as long as certain conditions are met as described later. in our experiments, we use a version of the cosine measure [***]<1> defined by $# % .  ffi  ffi . we intend to include these results in a longer journal version of this paper. 4 discussion of related work for background on indexing and query execution in ir and search engines, we refer to [3, 5, ***]<1>, and for basics of parallel search engine architecture we refer to [7, 8, 26, 34]<1>. discussions and comparisons of local and global index partitioning schemes and their performance are given, e influence:2 type:1 pair index:839 citer id:675 citer title:optimized query execution in large search engines with global page ordering citer abstract:large web search engines have to answer thousands of queries per second with interactive response times. a major factor in the cost of executing a query is given by the lengths of the inverted lists for the query terms, which increase with the size of the document collection and are often in the range of many megabytes. to address this issue, ir and database researchers have proposed pruning techniques that compute or approximate term-based ranking functions without scanning over the full inverted lists. over the last few years, search engines have incorporated new types of ranking techniques that exploit aspects such as the hyperlink structure of the web or the popularity of a page to obtain improved results. we focus on the question of how such techniques can be efficiently integrated into query processing. in particular, we study ʿҲʶpruning techniques for query execution in large engines in the case where we have a global ranking of pages, as provided by pagerank or any other method, in addition to the standard term-based approach. we describe pruning schemes for this case and evaluate their efficiency on an experimental clusterbased search engine with  million web pages. our results show that there is significant potential benefit in such techniques. citee id:641 citee title:implementations of partial document ranking using inverted files citee abstract:most commercial text retrieval systems employ inverted files to improve retrieval speedfithis paper concerns with the implementations of document ranking based on inverted filesfi threeheuristic methods for implementing the tf \thetaidf weighting strategy, where tf stands for term frequencyand idf stands for inverse document frequency, are studiedfi the basic idea of the heuristic methodsis to process the query terms in an order so that as many top documents as possible can be identified surrounding text:in the ir community, researcher have studied pruning techniques for the fast evaluation of vector space queries since at least the 1980s. some early work is described in [11, 21, 38, ***]<2>. most relevant to our work are the techniques by persin, zobel, and sacks-davis [32]<2> and the follow-up work in [1, 2]<2>, which study effective early termination (pruning) schemes for the cosine measure based on the idea of sorting the postings in the inverted lists by their contribution to the score of the document influence:3 type:2 pair index:840 citer id:682 citer title:modeling annotated data citer abstract:we consider the problem of modeling annotated datadata with multiple types where the instance of one type (such as a caption) serves as a description of the other type (such as an image). we describe three hierarchical probabilistic mixture models which aim to describe such data, culminating in correspondence latent dirichlet allocation, a latent variable model that is e ective at modeling the joint distribution of both types and the conditional distribution of the annotation given the primary type. we conduct experiments on the corel database of images and captions, assessing performance in terms of held-out likelihood, automatic annotation, and text-based image retrieval citee id:193 citee title:a variational bayesian framework for graphical models citee abstract:this paper presents a novel practical framework for bayesian modelaveraging and model selection in probabilistic graphical modelsfiour approach approximates full posterior distributions over modelparameters and structures, as well as latent variables, in an analyticalmannerfi these posteriors fall out of a free-form optimizationprocedure, which naturally incorporates conjugate priorsfi unlikein large sample approximations, the posteriors are generally nongaussianand no hessian needs surrounding text:in place of a point estimate, we now have a smooth posterior p(fi jd). a variational approach can again be used to find an approximation to this posterior distribution [***]<1>. introducing variational dirichlet parameters i for each of the k multinomials, we find that the only change to our earlier algorithm is to replace the maximization with respect to fi with the following variational update: ij =  +  d d=1  m m=1 1(wdm = j)  n n=1 nimn influence:3 type:1 pair index:841 citer id:682 citer title:modeling annotated data citer abstract:we consider the problem of modeling annotated datadata with multiple types where the instance of one type (such as a caption) serves as a description of the other type (such as an image). we describe three hierarchical probabilistic mixture models which aim to describe such data, culminating in correspondence latent dirichlet allocation, a latent variable model that is e ective at modeling the joint distribution of both types and the conditional distribution of the annotation given the primary type. we conduct experiments on the corel database of images and captions, assessing performance in terms of held-out likelihood, automatic annotation, and text-based image retrieval citee id:421 citee title:matching words and pictures citee abstract:: we present a new an very rich approach for moeling multi-moal ata sets,focusing on the specific case of segmente images with associate textfi learning thejoint istribution of image regions an wors has many applicationsfi we consier inetail preicting wors associate with whole images (auto-annotation) ancorresponing to particular image regions (region naming)fi auto-annotation mighthelp organize an access large collections of imagesfi region naming is a moel ofobject surrounding text:in addition to the traditional goals of retrieval, clustering, and classification, annotated data lends itself to tasks such as automatic data annotation and retrieval of unannotated data from annotation-type queries. a number of recent papers have considered generative probabilistic models for such multi-type or relational data [***, 6, 4, 13]<2>. these papers have generally focused on models that jointly cluster the di erent data types, basing the clustering on latent variable representations that capture low-dimensional probabilistic relationships among interacting sets of variables. 3. 1 a gaussianmultinomial mixture model we begin by considering a simple finite mixture model the model underlying most previous work on the probabilistic modeling of multi-type data [***, 13]<2>. in this model the gaussian-multinomial mixture (gm-mixture) shown in figure 1a single discrete latent variable z is used to represent a joint clustering of an image and its caption. , when k = 200, the perplexity is 2922). note that in related work [***]<2>, many of the models considered are variants of gm-mixture and rely heavily on an ad-hoc smoothing procedure to correct for overfitting. figure 5 (right) illustrates the caption perplexity under the smoothed estimates of each model using the empirical bayes procedure from section 4 influence:1 type:2 pair index:842 citer id:682 citer title:modeling annotated data citer abstract:we consider the problem of modeling annotated datadata with multiple types where the instance of one type (such as a caption) serves as a description of the other type (such as an image). we describe three hierarchical probabilistic mixture models which aim to describe such data, culminating in correspondence latent dirichlet allocation, a latent variable model that is e ective at modeling the joint distribution of both types and the conditional distribution of the annotation given the primary type. we conduct experiments on the corel database of images and captions, assessing performance in terms of held-out likelihood, automatic annotation, and text-based image retrieval citee id:164 citee title:latent dirichlet allocation citee abstract:we describe latent dirichlet allocation (lda), a generative probabilistic model for collections of discrete data such as text corpora. lda is a three-level hierarchical bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. in the context of text modeling, the topic probabilities provide an explicit representation of a document. we present efficient approximate inference techniques based on variational methods and an em algorithm for empirical bayes parameter estimation. we report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic lsi model surrounding text:3. 2 gaussianmultinomial lda the latent dirichlet allocation (lda) model is a latent variable model that allows factors to be allocated repeatedly within a given document or image [***]<1>. thus, di erent words in a document or di erent regions in an image can come from di erent underlying factors, and the document or image as a whole can be viewed as containing multiple \topics. rn) needed for region labeling. lda provides significant improvements in predictive performance over simpler mixture models in the domain of text data [***]<1>, and we expect for gm-lda to provide similar advantages over gm-mixture. indeed, we will see in section 5 that gm-lda does model the image/caption data better than gm-mixture influence:1 type:1 pair index:843 citer id:682 citer title:modeling annotated data citer abstract:we consider the problem of modeling annotated datadata with multiple types where the instance of one type (such as a caption) serves as a description of the other type (such as an image). we describe three hierarchical probabilistic mixture models which aim to describe such data, culminating in correspondence latent dirichlet allocation, a latent variable model that is e ective at modeling the joint distribution of both types and the conditional distribution of the annotation given the primary type. we conduct experiments on the corel database of images and captions, assessing performance in terms of held-out likelihood, automatic annotation, and text-based image retrieval citee id:752 citee title:the missing linka probabilistic model of document content and hypertext connectivity citee abstract:we describe a joint probabilistic model for modeling the contents and inter-connectivity of document collections such as sets of web pages or research paper archives. the model is based on a probabilistic factor decomposition and allows identifying principal topics of the collection as well as authoritative documents within those topics. furthermore, the relationships between topics is mapped out in order to build a predictive model of link content. among the many applications of this approach are information retrieval and search, topic identification, query disambiguation, focused web crawling, web authoring, and bibliometric analysis. surrounding text:in addition to the traditional goals of retrieval, clustering, and classification, annotated data lends itself to tasks such as automatic data annotation and retrieval of unannotated data from annotation-type queries. a number of recent papers have considered generative probabilistic models for such multi-type or relational data [2, 6, ***, 13]<2>. these papers have generally focused on models that jointly cluster the di erent data types, basing the clustering on latent variable representations that capture low-dimensional probabilistic relationships among interacting sets of variables influence:3 type:2 pair index:844 citer id:682 citer title:modeling annotated data citer abstract:we consider the problem of modeling annotated datadata with multiple types where the instance of one type (such as a caption) serves as a description of the other type (such as an image). we describe three hierarchical probabilistic mixture models which aim to describe such data, culminating in correspondence latent dirichlet allocation, a latent variable model that is e ective at modeling the joint distribution of both types and the conditional distribution of the annotation given the primary type. we conduct experiments on the corel database of images and captions, assessing performance in terms of held-out likelihood, automatic annotation, and text-based image retrieval citee id:640 citee title:image information retrieval: an overview of current research citee abstract:this paper provides an overview of current research in image information retrieval and provides an outline of areas for future researchfi the approachis broad and interdisciplinary and focuses on three aspects of image research (ir): text-based retrieval, content-based retrieval, and userinteractions with image information retrieval systemsfi the review concludes with a call for image retrieval evaluation studies similar to trecfikeywords: information science, image retrieval, cbir, surrounding text:" 5. 3 textbased image retrieval there has been a significant amount of computer science research on content-based image retrieval in which a particular query image (possibly a sketch or primitive graphic) is used to find matching relevant images [***]<1>. in another line of research, multimedia information retrieval, representations of di erent data types (such as text and images) are used to retrieve documents that contain both [8]<2> influence:2 type:3 pair index:845 citer id:682 citer title:modeling annotated data citer abstract:we consider the problem of modeling annotated datadata with multiple types where the instance of one type (such as a caption) serves as a description of the other type (such as an image). we describe three hierarchical probabilistic mixture models which aim to describe such data, culminating in correspondence latent dirichlet allocation, a latent variable model that is e ective at modeling the joint distribution of both types and the conditional distribution of the annotation given the primary type. we conduct experiments on the corel database of images and captions, assessing performance in terms of held-out likelihood, automatic annotation, and text-based image retrieval citee id:267 citee title:automatic image annotation and retrieval using cross-media relevance models citee abstract:libraries have traditionally used manual image annotationfor indexing and then later retrieving their image collectionsfihowever, manual image annotation is an expensive and laborintensive procedure and hence there has been great interestin coming up with automatic ways to retrieve imagesbased on contentfi here, we propose an automatic approachto annotating and retrieving images based on a training setof imagesfi we assume that regions in an image can be describedusing a small vocabulary surrounding text:in addition to the traditional goals of retrieval, clustering, and classification, annotated data lends itself to tasks such as automatic data annotation and retrieval of unannotated data from annotation-type queries. a number of recent papers have considered generative probabilistic models for such multi-type or relational data [2, ***, 4, 13]<2>. these papers have generally focused on models that jointly cluster the di erent data types, basing the clustering on latent variable representations that capture low-dimensional probabilistic relationships among interacting sets of variables influence:1 type:2 pair index:846 citer id:682 citer title:modeling annotated data citer abstract:we consider the problem of modeling annotated datadata with multiple types where the instance of one type (such as a caption) serves as a description of the other type (such as an image). we describe three hierarchical probabilistic mixture models which aim to describe such data, culminating in correspondence latent dirichlet allocation, a latent variable model that is e ective at modeling the joint distribution of both types and the conditional distribution of the annotation given the primary type. we conduct experiments on the corel database of images and captions, assessing performance in terms of held-out likelihood, automatic annotation, and text-based image retrieval citee id:662 citee title:introduction to variational methods for graphical models citee abstract:this paper presents a tutorial introduction to the use of variational methods for inference and learning in graphical models (bayesian networks and markov random fields).we present a number of examples of graphical models, including the qmr-dt database, the sigmoid belief network, the boltzmann machine, and several variants of hidden markov models, in which it is infeasible to run exact inference algorithms.we then introduce variational methods, which exploit laws of large numbers to transform the original graphical model into a simplified graphical model in which inference is efficient. inference in the simpified model provides bounds on probabilities of interest in the original model.we describe a general framework for generating variational transformations based on convex duality. finally we return to the examples and demonstrate how variational algorithms can be formulated in each case surrounding text:1 variational inference exact probabilistic inference for corr-lda is intractable. as before, we avail ourselves of variational inference methods [***]<1> to approximate the posterior distribution over the latent variables given a particular image/caption. in particular, we define the following factorized distribution on the latent variables: q( influence:2 type:1 pair index:847 citer id:682 citer title:modeling annotated data citer abstract:we consider the problem of modeling annotated datadata with multiple types where the instance of one type (such as a caption) serves as a description of the other type (such as an image). we describe three hierarchical probabilistic mixture models which aim to describe such data, culminating in correspondence latent dirichlet allocation, a latent variable model that is e ective at modeling the joint distribution of both types and the conditional distribution of the annotation given the primary type. we conduct experiments on the corel database of images and captions, assessing performance in terms of held-out likelihood, automatic annotation, and text-based image retrieval citee id:141 citee title:a model of multimedia information retrieval citee abstract:research on multimedia information retrieval (mir) has recently witnessed a booming interestfi a prominent feature of this research trend is its simultaneous but independent materialization within several fields of computer sciencefi the resulting richness of paradigms, methods and systems may, on the long run, result in a fragmentation of efforts and slow down progressfi the primary goal of this study is to promote an integration of methods and techniques for mir by contributing a conceptual model that encompasses in a unified and coherent perspective the many efforts that are being produced under the label of mirfi the model offers a retrieval capability that spans two media, text and images, but also several dimensions: form, content and structurefi in this way, it reconciles similarity-based methods with semantics-based ones, providing the guidelines for the design of systems that are able to provide a generalized multimedia retrieval service, in which the existing forms of retrieval not only coexist, but can be combined in any desired mannerfi the model is formulated in terms of a fuzzy description logic, which plays a twofold role: (1) it directly models semantics-based retrieval, and (2) it offers an ideal framework for the integration of the multimedia and multidimensional aspects of retrieval mentioned abovefi the model also accounts for relevance feedback in both text and image retrieval, integrating known techniques for taking into account user judgmentsfi the implementation of the model is addressed by presenting a decomposition technique that reduces query evaluation to the processing of simpler requests, each of which can be solved by means of widely known methods for text and image retrieval, and semantic processingfi a prototype for multidimensional image retrieval is presented that shows this decomposition technique at work in a significant casefi surrounding text:3 textbased image retrieval there has been a significant amount of computer science research on content-based image retrieval in which a particular query image (possibly a sketch or primitive graphic) is used to find matching relevant images [5]<1>. in another line of research, multimedia information retrieval, representations of di erent data types (such as text and images) are used to retrieve documents that contain both [***]<2>. 3we cannot quantitatively evaluate this task (i influence:2 type:3 pair index:848 citer id:682 citer title:modeling annotated data citer abstract:we consider the problem of modeling annotated datadata with multiple types where the instance of one type (such as a caption) serves as a description of the other type (such as an image). we describe three hierarchical probabilistic mixture models which aim to describe such data, culminating in correspondence latent dirichlet allocation, a latent variable model that is e ective at modeling the joint distribution of both types and the conditional distribution of the annotation given the primary type. we conduct experiments on the corel database of images and captions, assessing performance in terms of held-out likelihood, automatic annotation, and text-based image retrieval citee id:152 citee title:a probabilistic framework for semantic video indexing, filtering and retrieval citee abstract:semantic filtering and retrieval of multimedia content is crucial for efficient use of the multimedia data repositories. video query by semantic keywords is one of the most difficult problems in multimedia data retrieval. the difficulty lies in the mapping between low-level video representation and high-level semantics. we therefore formulate the multimedia content access problem as a multimedia pattern recognition problem. we propose a probabilistic framework for semantic video indexing, which can support filtering and retrieval and facilitate efficient content-based access. to map low-level features to high-level semantics we propose probabilistic multimedia objects (multijects). examples of multijects in movies include explosion, mountain, beach, outdoor, music etc. semantic concepts in videos interact and to model this interaction explicitly, we propose a network of multijects (multinet). using probabilistic models for six site multijects, rocks, sky, snow, water-body, forestry/greenery and outdoor and using a bayesian belief network as the multinet we demonstrate the application of this framework to semantic indexing.we demonstrate how detection performance can be significantly improved using the multinet to take interconceptual relationships into account. we also show how the multinet can fuse heterogeneous features to support detection based on inference and reasoning. surrounding text:less attention, however, has been focused on text-based image retrieval, an arguably more difficult task where a user submits a text query to find matching images for which there is no related text. previous approaches have essentially treated this task as a classification problem, handling specific queries from a vocabulary of about five words [***]<2>. in contrast, by using the conditional distribution of words given an image, our approach can handle arbitrary queries from a large vocabulary influence:1 type:2 pair index:849 citer id:682 citer title:modeling annotated data citer abstract:we consider the problem of modeling annotated datadata with multiple types where the instance of one type (such as a caption) serves as a description of the other type (such as an image). we describe three hierarchical probabilistic mixture models which aim to describe such data, culminating in correspondence latent dirichlet allocation, a latent variable model that is e ective at modeling the joint distribution of both types and the conditional distribution of the annotation given the primary type. we conduct experiments on the corel database of images and captions, assessing performance in terms of held-out likelihood, automatic annotation, and text-based image retrieval citee id:125 citee title:a language modeling approach to information retrieval citee abstract:models of document indexing and document retrieval have been extensively studied. the integration of these two classes of models has been the goal of several researchers but it is a very difficult problem. we argue that much of the reason for this is the lack of an adequate indexing model. this suggests that perhaps a better indexing model would help solve the problem. however, we feel that making unwarranted parametric assumptions will not lead to better retrieval performance. furthermore, making prior assumptions about the similarity of documents is not warranted either. instead, we propose an approach to retrieval based on probabilistic language modeling. we estimate models for each document individually. our approach to modeling is non-parametric and integrates document indexing and document retrieval into a single model. one advantage of our approach is that collection statistics which are used heuristically in many other retrieval models are an integral part of our model. we have implemented our model and tested it empirically. our approach significantly outperforms standard tf.idf weighting on two different collections and query sets. surrounding text:in contrast, by using the conditional distribution of words given an image, our approach can handle arbitrary queries from a large vocabulary. we use a unique form of the language modeling approach to information retrieval [***]<1> where the document language models are derived from images rather than words. for each unannotated image, we obtain an image-specific distribution over words by computing the conditional distribution p(w j r) which is available for the models described in section 3 influence:2 type:1 pair index:850 citer id:682 citer title:modeling annotated data citer abstract:we consider the problem of modeling annotated datadata with multiple types where the instance of one type (such as a caption) serves as a description of the other type (such as an image). we describe three hierarchical probabilistic mixture models which aim to describe such data, culminating in correspondence latent dirichlet allocation, a latent variable model that is e ective at modeling the joint distribution of both types and the conditional distribution of the annotation given the primary type. we conduct experiments on the corel database of images and captions, assessing performance in terms of held-out likelihood, automatic annotation, and text-based image retrieval citee id:321 citee title:normalized cuts and image segmentation citee abstract:we propose a novel approach for solving the perfi ceptual grouping problem in vision rather than fofi cusing on local features and their consistencies in the image data our approach aims at extracting the global impression of an image we treat image segmentafi tion as a graph partitioning problem and propose a novel global criterion the normalized cut for segmentfi ing the graph the normalized cut criterion measures both the total dissimilarity between the dierent groups as well as the total similarity within the groups we show that an ecient computational technique based on a generalized eigenvalue problem can be used to opfi timize this criterion we have applied this approach to segmenting static images and found results very enfi couraging surrounding text:image/caption data our work has focused on images and their captions from the corel database. following previous work [2]<1>, each image is segmented into regions by the n-cuts algorithm [***]<1>. for each region, we compute a set of real-valued features representing visual properties such as size, position, color, texture, and shape influence:3 type:3 pair index:851 citer id:691 citer title:maximum likelihood estimation of dirichlet distributions citer abstract:dirichlet distributions are commonly used as priors over proportional data. in this paper, i will introduce this distribution, discuss why it is useful, and compare implementations of 4 different methods for estimating its parameters from observed data citee id:512 citee title:estimating a dirichlet distribution citee abstract:the dirichlet distribution and its compound variant, the dirichlet-multinomial, are two of the most basic models for proportional data, such as the mix of vocabulary words in a text document. yet the maximum-likelihood estimate of these distributions is not available in closed-form. this paper describes simple and efficient iterative schemes for obtaining parameter estimates in these models. in each case, a fixed-point iteration and a newton-raphson (or generalized newton-raphson) iteration is provided surrounding text:a fixed point iteration. minka [***]<1> provides a convergent fixed point iteration technique for estimating parameters. the idea behind this is to guess an initial , find a function that bounds f from below which is tight at , then to optimize this function to arrive at a new guess at influence:1 type:1 pair index:852 citer id:691 citer title:maximum likelihood estimation of dirichlet distributions citer abstract:dirichlet distributions are commonly used as priors over proportional data. in this paper, i will introduce this distribution, discuss why it is useful, and compare implementations of 4 different methods for estimating its parameters from observed data citee id:164 citee title:latent dirichlet allocation citee abstract:we describe latent dirichlet allocation (lda), a generative probabilistic model for collections of discrete data such as text corpora. lda is a three-level hierarchical bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. in the context of text modeling, the topic probabilities provide an explicit representation of a document. we present efficient approximate inference techniques based on variational methods and an em algorithm for empirical bayes parameter estimation. we report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic lsi model surrounding text:conclusions the example that motivated this project comes from latent semantic analysis in text modeling. in a commonly cited model, the latent dirichlet allocation model [***]<1>, a dirichlet prior is incorporated into a generative model for a text corpus, where every multinomial drawn from it represents how a document is mixed in terms of topics. for example, a document might spend 1/3 of its words discussing statistics, 1/2 on numerical methods, and 1/6 on algebraic topology influence:2 type:1,3 pair index:853 citer id:744 citer title:modeling and predicting personal information dissemination behavior citer abstract:in this paper, we propose a new way to automatically model and predict human behavior of receiving and disseminating information by analyzing the contact and content of personal communications. a personal profile, called communitynet, is established for each individual based on a novel algorithm incorporating contact, content, and time information simultaneously. it can be used for personal social capital management. clusters of communitynets provide a view of informal networks for organization management. our new algorithm is developed based on the combination of dynamic algorithms in the social network field and the semantic content classification methods in the natural language processing and machine learning literatures. we tested communitynets on the enron email corpus and report experimental results including filtering, prediction, and recommendation capabilities. we show that the personal behavior and intention are somewhat predictable based on these models. for instance, "to whom a person is going to send a specific email" can be predicted by ones personal social network and content analysis. experimental results show the prediction accuracy of the proposed adaptive algorithm is 58% better than the social network-based predictions, and is 75% better than an aggregated model based on latent dirichlet allocation with social network enhancement. two online demo systems we developed that allow interactive exploration of communitynet are also discussed citee id:678 citee title:its not what you know, its who you know: work in the information age citee abstract:we discuss our ethnographic research on personal social networks in the workplace, arguing that traditional institutional resources are being replaced by resources that workers mine from their own networks. social networks are key sources of labor and information in a rapidly transforming economy characterized by less institutional stability and fewer reliable corporate resources. the personal social network is fast becoming the only sensible alternative to the traditional "org chart" for many everyday transactions in today's economy. surrounding text:6 [artificial intelligence]: learning general terms: algorithms, experimentation keywords: user behavior modeling, personal information management, information dissemination 1. introduction working in the information age, the most important is not what you know, but who you know [***]<3>. a social network, the graph of relationships and interactions within a group of individuals, plays a fundamental role as a medium for the spread of information, ideas, and influence influence:2 type:3 pair index:854 citer id:744 citer title:modeling and predicting personal information dissemination behavior citer abstract:in this paper, we propose a new way to automatically model and predict human behavior of receiving and disseminating information by analyzing the contact and content of personal communications. a personal profile, called communitynet, is established for each individual based on a novel algorithm incorporating contact, content, and time information simultaneously. it can be used for personal social capital management. clusters of communitynets provide a view of informal networks for organization management. our new algorithm is developed based on the combination of dynamic algorithms in the social network field and the semantic content classification methods in the natural language processing and machine learning literatures. we tested communitynets on the enron email corpus and report experimental results including filtering, prediction, and recommendation capabilities. we show that the personal behavior and intention are somewhat predictable based on these models. for instance, "to whom a person is going to send a specific email" can be predicted by ones personal social network and content analysis. experimental results show the prediction accuracy of the proposed adaptive algorithm is 58% better than the social network-based predictions, and is 75% better than an aggregated model based on latent dirichlet allocation with social network enhancement. two online demo systems we developed that allow interactive exploration of communitynet are also discussed citee id:745 citee title:structure, culture and simmelian ties in entrepreneurial firms citee abstract:this article develops a cultural agreement approach to organizational culture that emphasizes how clusters of individuals reinforce potentially idiosyncratic understandings of many aspects of culture including the structure of network relations. building on recent work concerning simmelian tied dyads (defined as dyads embedded in three-person cliques), the research examines perceptions concerning advice and friendship relations in three entrepreneurial firms. the results support the idea that simmelian tied dyads (relative to dyads in general) reach higher agreement concerning who is tied to whom, and who are embedded together in triads in organizations surrounding text:krackhardt [2]<3> showed that companies with strong informal networks perform five or six times better than those with weak networks, especially on the long-term performance. friend and advice networks drive enterprise operations in a way that, if the real organization structure does not match the informal networks, then a company tends to fail [***]<3>. since max weber first studied modern bureaucracy structures in the 1920s, decades of related social scientific researches have been mainly relying on questionnaires and interviews to understand individuals' thoughts and behaviors for sensing informal networks, however, data collection is time consuming and seldom provides timely, continuous, and dynamic information influence:3 type:3 pair index:855 citer id:744 citer title:modeling and predicting personal information dissemination behavior citer abstract:in this paper, we propose a new way to automatically model and predict human behavior of receiving and disseminating information by analyzing the contact and content of personal communications. a personal profile, called communitynet, is established for each individual based on a novel algorithm incorporating contact, content, and time information simultaneously. it can be used for personal social capital management. clusters of communitynets provide a view of informal networks for organization management. our new algorithm is developed based on the combination of dynamic algorithms in the social network field and the semantic content classification methods in the natural language processing and machine learning literatures. we tested communitynets on the enron email corpus and report experimental results including filtering, prediction, and recommendation capabilities. we show that the personal behavior and intention are somewhat predictable based on these models. for instance, "to whom a person is going to send a specific email" can be predicted by ones personal social network and content analysis. experimental results show the prediction accuracy of the proposed adaptive algorithm is 58% better than the social network-based predictions, and is 75% better than an aggregated model based on latent dirichlet allocation with social network enhancement. two online demo systems we developed that allow interactive exploration of communitynet are also discussed citee id:370 citee title:contactmap: integrating communication and information through visualizing personal social networks citee abstract:in a few short years, we have witnessed a massive uptake in the use of cell phones, personal digital assistants and hybrid devices that integrate phone, computer and internet services. these devices communicate with one another and with traditional computers. along with the internet, they have transformed our computational environments into communication spaces. the goal of our research is to seamlessly integrate communication with the traditional information functions of computational devices. what could provide an organizing principle for advanced user interfaces that afford information and communication services in a single integrated system? based on our research on communication patterns in the workplace, our answer is models of personal social networks. our research showed that people invest considerable effort in maintaining links with networks of colleagues, acquaintances and friends, and that these networks are a significant organizing principle for work and information. in this article we report a study of workplace communication that informed our development efforts. we describe our evolving software prototype, contactmap, and user experiments with contactmap. surrounding text:personal social network (psn) could provide an organizing principle for advanced user interfaces that offer information management and communication services in a single integrated system. one of the most pronounced examples is the networking study by nardi et al$ [***]<2>, who coined the term intensional networks to describe personal social networks. they presented a visual model of users psn to organize personal communications in terms of a social network of contacts influence:2 type:1 pair index:856 citer id:744 citer title:modeling and predicting personal information dissemination behavior citer abstract:in this paper, we propose a new way to automatically model and predict human behavior of receiving and disseminating information by analyzing the contact and content of personal communications. a personal profile, called communitynet, is established for each individual based on a novel algorithm incorporating contact, content, and time information simultaneously. it can be used for personal social capital management. clusters of communitynets provide a view of informal networks for organization management. our new algorithm is developed based on the combination of dynamic algorithms in the social network field and the semantic content classification methods in the natural language processing and machine learning literatures. we tested communitynets on the enron email corpus and report experimental results including filtering, prediction, and recommendation capabilities. we show that the personal behavior and intention are somewhat predictable based on these models. for instance, "to whom a person is going to send a specific email" can be predicted by ones personal social network and content analysis. experimental results show the prediction accuracy of the proposed adaptive algorithm is 58% better than the social network-based predictions, and is 75% better than an aggregated model based on latent dirichlet allocation with social network enhancement. two online demo systems we developed that allow interactive exploration of communitynet are also discussed citee id:438 citee title:discovering shared interests among people using graph analysis citee abstract:an important problem faced by users of large networks is how to discover resources of interest, such as data orpeople. in this paper we focus on locating people with particular interests or expertise. the usual approach is tobuild interest group lists from explicitly registered data. however, doing so assumes one knows what lists should bebuilt, and who should be included in each list. we present an alternative approach, which can support a more finegrained and dynamically adaptive notion of shared interests. our approach deduces interests from the history ofelectronic mail communication, using a set of heuristic graph algorithms. we demonstrate the algorithms by apply-ing them to data collected from 15 sites for two months. using these algorithms we were able to deduce sharedinterest lists for people far beyond the data collection sites. the algorithms we present are powerful, and if abusedcan threaten privacy. we propose guidelines that we believe should underly the ethical use of these algorithms. wediscuss several possible applications that we believe do not threaten privacy, including discovering resources otherthan people, such as file system data. surrounding text:lately, introducing social network analysis into information mining is becoming an important research area. schwartz and wood [***]<2> mined social relationships from email logs by using a set of heuristic graph algorithms. the referral web project [12]<2> mined a social network from a wide variety of publicly-available online information, and used it to help individuals find experts who could answer their questions based on geographical proximity influence:2 type:2 pair index:857 citer id:744 citer title:modeling and predicting personal information dissemination behavior citer abstract:in this paper, we propose a new way to automatically model and predict human behavior of receiving and disseminating information by analyzing the contact and content of personal communications. a personal profile, called communitynet, is established for each individual based on a novel algorithm incorporating contact, content, and time information simultaneously. it can be used for personal social capital management. clusters of communitynets provide a view of informal networks for organization management. our new algorithm is developed based on the combination of dynamic algorithms in the social network field and the semantic content classification methods in the natural language processing and machine learning literatures. we tested communitynets on the enron email corpus and report experimental results including filtering, prediction, and recommendation capabilities. we show that the personal behavior and intention are somewhat predictable based on these models. for instance, "to whom a person is going to send a specific email" can be predicted by ones personal social network and content analysis. experimental results show the prediction accuracy of the proposed adaptive algorithm is 58% better than the social network-based predictions, and is 75% better than an aggregated model based on latent dirichlet allocation with social network enhancement. two online demo systems we developed that allow interactive exploration of communitynet are also discussed citee id:746 citee title:referral web: combining social networks and collaborative filtering citee abstract:this paper appears in the communications of the acm,volfi 40 nofi 3, march, 1997finumerous studies have shown that one of the most the most effective channelsfor dissemination of information and expertise within an organization is itsinformal network of collaborators, colleagues, and friends (granovetter 1973;kraut 1990; wasserman and galaskiewicz 1994)fi indeed, the social network1is as least as important as the official organizational structure for tasks rangingfrom immediate, local surrounding text:schwartz and wood [11]<2> mined social relationships from email logs by using a set of heuristic graph algorithms. the referral web project [***]<2> mined a social network from a wide variety of publicly-available online information, and used it to help individuals find experts who could answer their questions based on geographical proximity. flake et al influence:2 type:2 pair index:858 citer id:744 citer title:modeling and predicting personal information dissemination behavior citer abstract:in this paper, we propose a new way to automatically model and predict human behavior of receiving and disseminating information by analyzing the contact and content of personal communications. a personal profile, called communitynet, is established for each individual based on a novel algorithm incorporating contact, content, and time information simultaneously. it can be used for personal social capital management. clusters of communitynets provide a view of informal networks for organization management. our new algorithm is developed based on the combination of dynamic algorithms in the social network field and the semantic content classification methods in the natural language processing and machine learning literatures. we tested communitynets on the enron email corpus and report experimental results including filtering, prediction, and recommendation capabilities. we show that the personal behavior and intention are somewhat predictable based on these models. for instance, "to whom a person is going to send a specific email" can be predicted by ones personal social network and content analysis. experimental results show the prediction accuracy of the proposed adaptive algorithm is 58% better than the social network-based predictions, and is 75% better than an aggregated model based on latent dirichlet allocation with social network enhancement. two online demo systems we developed that allow interactive exploration of communitynet are also discussed citee id:747 citee title:self-organization and identification of web communities citee abstract:despite the decentralized and unorganized nature of the web, we show that the web self-organizes such that communities of highly related pages can be efficiently identified based purely on connectivityfi surrounding text:flake et al. [***]<3> used graph algorithms to mine communities from the web (defined as sets of sites that have more links to each other than to non-members). tyler et al influence:2 type:2 pair index:859 citer id:744 citer title:modeling and predicting personal information dissemination behavior citer abstract:in this paper, we propose a new way to automatically model and predict human behavior of receiving and disseminating information by analyzing the contact and content of personal communications. a personal profile, called communitynet, is established for each individual based on a novel algorithm incorporating contact, content, and time information simultaneously. it can be used for personal social capital management. clusters of communitynets provide a view of informal networks for organization management. our new algorithm is developed based on the combination of dynamic algorithms in the social network field and the semantic content classification methods in the natural language processing and machine learning literatures. we tested communitynets on the enron email corpus and report experimental results including filtering, prediction, and recommendation capabilities. we show that the personal behavior and intention are somewhat predictable based on these models. for instance, "to whom a person is going to send a specific email" can be predicted by ones personal social network and content analysis. experimental results show the prediction accuracy of the proposed adaptive algorithm is 58% better than the social network-based predictions, and is 75% better than an aggregated model based on latent dirichlet allocation with social network enhancement. two online demo systems we developed that allow interactive exploration of communitynet are also discussed citee id:506 citee title:email as spectroscopy: automated discovery of community structure within organizations citee abstract:we describe a method for the automatic identification of communities of practice from email logs within an organizationfi we use a betweenness centrality algorithm that can rapidly find communities within a graph representing information flowsfi we apply this algorithm to an email corpus of nearly one million messages collected over a two-month span, and show that the method is effective at identifying true communities, both formal and informal, within these scale-free graphsfi this approach also enables the identification of leadership roles within the communitiesfi these studies are complemented by a qualitative evaluation of the results in the fieldfi surrounding text:tyler et al. [***]<3> use a betweenness centrality algorithm for the automatic identification of communities of practice from email logs within an organization. the google search engine [15]<2> and kleinberg's hits algorithm of finding hubs and authorities on the web [16]<2> are also based on social network concepts influence:2 type:2 pair index:860 citer id:744 citer title:modeling and predicting personal information dissemination behavior citer abstract:in this paper, we propose a new way to automatically model and predict human behavior of receiving and disseminating information by analyzing the contact and content of personal communications. a personal profile, called communitynet, is established for each individual based on a novel algorithm incorporating contact, content, and time information simultaneously. it can be used for personal social capital management. clusters of communitynets provide a view of informal networks for organization management. our new algorithm is developed based on the combination of dynamic algorithms in the social network field and the semantic content classification methods in the natural language processing and machine learning literatures. we tested communitynets on the enron email corpus and report experimental results including filtering, prediction, and recommendation capabilities. we show that the personal behavior and intention are somewhat predictable based on these models. for instance, "to whom a person is going to send a specific email" can be predicted by ones personal social network and content analysis. experimental results show the prediction accuracy of the proposed adaptive algorithm is 58% better than the social network-based predictions, and is 75% better than an aggregated model based on latent dirichlet allocation with social network enhancement. two online demo systems we developed that allow interactive exploration of communitynet are also discussed citee id:748 citee title:the pagerank citation ranking: bringing order to the web citee abstract:the importance of a webpage is an inherently subjective matter, which depends on the readers interests, knowledge and attitudes. but there is still much that can be said objectively about the relative importance of web pages. this paper describes pagerank, a method for rating web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them. we compare pagerank to an idealized random websurfer. we show how to efficiently compute pagerank for large numbers of pages. and,we show how to apply pagerank to search and to user navigation. surrounding text:[14]<3> use a betweenness centrality algorithm for the automatic identification of communities of practice from email logs within an organization. the google search engine [***]<2> and kleinberg's hits algorithm of finding hubs and authorities on the web [16]<2> are also based on social network concepts. the success of these approaches, and the discovery of widespread network topologies with nontrivial properties, have led to a recent flurry of research on applying link analysis for information mining influence:2 type:1 pair index:861 citer id:744 citer title:modeling and predicting personal information dissemination behavior citer abstract:in this paper, we propose a new way to automatically model and predict human behavior of receiving and disseminating information by analyzing the contact and content of personal communications. a personal profile, called communitynet, is established for each individual based on a novel algorithm incorporating contact, content, and time information simultaneously. it can be used for personal social capital management. clusters of communitynets provide a view of informal networks for organization management. our new algorithm is developed based on the combination of dynamic algorithms in the social network field and the semantic content classification methods in the natural language processing and machine learning literatures. we tested communitynets on the enron email corpus and report experimental results including filtering, prediction, and recommendation capabilities. we show that the personal behavior and intention are somewhat predictable based on these models. for instance, "to whom a person is going to send a specific email" can be predicted by ones personal social network and content analysis. experimental results show the prediction accuracy of the proposed adaptive algorithm is 58% better than the social network-based predictions, and is 75% better than an aggregated model based on latent dirichlet allocation with social network enhancement. two online demo systems we developed that allow interactive exploration of communitynet are also discussed citee id:241 citee title:authoritative sources in a hyperlinked environment citee abstract:the network structure of a hyperlinked environment can be a rich source of in-formation about the content of the environment, provided we have effective means forunderstanding it. we develop a set of algorithmic tools for extracting information fromthe link structures of such environments, and report on experiments that demonstratetheir effectiveness in a variety of contexts on the world wide web. the central issuewe address within our framework is the distillation of broad search topics, throughthe discovery of authoritative information sources on such topics. we propose andtest an algorithmic formulation of the notion of authority, based on the relationshipbetween a set of relevant authoritative pages and the set of hub pages that join themtogether in the link structure. our formulation has connections to the eigenvectorsof certain matrices associated with the link graph; these connections in turn motivateadditional heuristics for link-based analysis. surrounding text:[14]<3> use a betweenness centrality algorithm for the automatic identification of communities of practice from email logs within an organization. the google search engine [15]<2> and kleinberg's hits algorithm of finding hubs and authorities on the web [***]<2> are also based on social network concepts. the success of these approaches, and the discovery of widespread network topologies with nontrivial properties, have led to a recent flurry of research on applying link analysis for information mining influence:2 type:1 pair index:862 citer id:744 citer title:modeling and predicting personal information dissemination behavior citer abstract:in this paper, we propose a new way to automatically model and predict human behavior of receiving and disseminating information by analyzing the contact and content of personal communications. a personal profile, called communitynet, is established for each individual based on a novel algorithm incorporating contact, content, and time information simultaneously. it can be used for personal social capital management. clusters of communitynets provide a view of informal networks for organization management. our new algorithm is developed based on the combination of dynamic algorithms in the social network field and the semantic content classification methods in the natural language processing and machine learning literatures. we tested communitynets on the enron email corpus and report experimental results including filtering, prediction, and recommendation capabilities. we show that the personal behavior and intention are somewhat predictable based on these models. for instance, "to whom a person is going to send a specific email" can be predicted by ones personal social network and content analysis. experimental results show the prediction accuracy of the proposed adaptive algorithm is 58% better than the social network-based predictions, and is 75% better than an aggregated model based on latent dirichlet allocation with social network enhancement. two online demo systems we developed that allow interactive exploration of communitynet are also discussed citee id:749 citee title:models for longitudinal network data citee abstract:: this chapter treats statistical methods for network evolutionfi itis argued that it is most fruitful to consider models where networkevolution is represented as the result of many (usually non-observed)small changes occurring between the consecutively observed networksfi surrounding text:are interested in tracking changes in large-scale data by periodically creating an agglomerative clustering and examining the evolution of clusters over time. among the known dynamical social networks in literature, snijder's dynamic actor-oriented social network [***]<2> is one of the most successful algorithms. changes in the network are modeled as the stochastic result of network effects (density, reciprocity, etc influence:2 type:2 pair index:863 citer id:744 citer title:modeling and predicting personal information dissemination behavior citer abstract:in this paper, we propose a new way to automatically model and predict human behavior of receiving and disseminating information by analyzing the contact and content of personal communications. a personal profile, called communitynet, is established for each individual based on a novel algorithm incorporating contact, content, and time information simultaneously. it can be used for personal social capital management. clusters of communitynets provide a view of informal networks for organization management. our new algorithm is developed based on the combination of dynamic algorithms in the social network field and the semantic content classification methods in the natural language processing and machine learning literatures. we tested communitynets on the enron email corpus and report experimental results including filtering, prediction, and recommendation capabilities. we show that the personal behavior and intention are somewhat predictable based on these models. for instance, "to whom a person is going to send a specific email" can be predicted by ones personal social network and content analysis. experimental results show the prediction accuracy of the proposed adaptive algorithm is 58% better than the social network-based predictions, and is 75% better than an aggregated model based on latent dirichlet allocation with social network enhancement. two online demo systems we developed that allow interactive exploration of communitynet are also discussed citee id:750 citee title:the link prediction problem for social networks citee abstract:: given a snapshot of a social network, can we infer which new interactions among its membersare likely to occur in the near future? we formalize this question as the link prediction problem,and develop approaches to link prediction based on measures for analyzing the "proximity" ofnodes in a networkfi experiments on large co-authorship networks suggest that informationabout future interactions can be extracted from network topology alone, and that fairly subtlemeasures for detecting surrounding text:dynamics of social networks have attracted many researchers attentions recently. given a snapshot of a social network, [***]<2> tries to infer which new interactions among its members are likely to occur in the near future. in [20]<2>, kubica et al influence:1 type:2 pair index:864 citer id:744 citer title:modeling and predicting personal information dissemination behavior citer abstract:in this paper, we propose a new way to automatically model and predict human behavior of receiving and disseminating information by analyzing the contact and content of personal communications. a personal profile, called communitynet, is established for each individual based on a novel algorithm incorporating contact, content, and time information simultaneously. it can be used for personal social capital management. clusters of communitynets provide a view of informal networks for organization management. our new algorithm is developed based on the combination of dynamic algorithms in the social network field and the semantic content classification methods in the natural language processing and machine learning literatures. we tested communitynets on the enron email corpus and report experimental results including filtering, prediction, and recommendation capabilities. we show that the personal behavior and intention are somewhat predictable based on these models. for instance, "to whom a person is going to send a specific email" can be predicted by ones personal social network and content analysis. experimental results show the prediction accuracy of the proposed adaptive algorithm is 58% better than the social network-based predictions, and is 75% better than an aggregated model based on latent dirichlet allocation with social network enhancement. two online demo systems we developed that allow interactive exploration of communitynet are also discussed citee id:591 citee title:stochastic link and group detection citee abstract:link detection and analysis has long been important in the social sciences and in the government intelligence communityfi a significant effort is focused on the structural and functional analysis of "known" networksfi similarly, the detection of individual links is important but is usually done with techniques that result in "known" linksfi more recently the internet and other sources have led to a flood of circumstantial data that provide probabilistic evidence of linksfi co-occurrence in news surrounding text:given a snapshot of a social network, [19]<2> tries to infer which new interactions among its members are likely to occur in the near future. in [***]<2>, kubica et al. are interested in tracking changes in large-scale data by periodically creating an agglomerative clustering and examining the evolution of clusters over time influence:2 type:2 pair index:865 citer id:744 citer title:modeling and predicting personal information dissemination behavior citer abstract:in this paper, we propose a new way to automatically model and predict human behavior of receiving and disseminating information by analyzing the contact and content of personal communications. a personal profile, called communitynet, is established for each individual based on a novel algorithm incorporating contact, content, and time information simultaneously. it can be used for personal social capital management. clusters of communitynets provide a view of informal networks for organization management. our new algorithm is developed based on the combination of dynamic algorithms in the social network field and the semantic content classification methods in the natural language processing and machine learning literatures. we tested communitynets on the enron email corpus and report experimental results including filtering, prediction, and recommendation capabilities. we show that the personal behavior and intention are somewhat predictable based on these models. for instance, "to whom a person is going to send a specific email" can be predicted by ones personal social network and content analysis. experimental results show the prediction accuracy of the proposed adaptive algorithm is 58% better than the social network-based predictions, and is 75% better than an aggregated model based on latent dirichlet allocation with social network enhancement. two online demo systems we developed that allow interactive exploration of communitynet are also discussed citee id:169 citee title:probabilistic latent semantic analysis citee abstract:probabilistic latent semantic analysis is a novel statistical technique for the analysis of two{mode and co-occurrence data, which has applications in information retrieval and ltering, natural language processing, ma-chine learning from text, and in related ar-easfi compared to standard latent semantic analysis which stems from linear algebra and performs a singular value decomposition of co-occurrence tables, the proposed method is based on a mixture decomposition derived from a latent class modelfi this results in a more principled approach which has a solid foundation in statisticsfi in order to avoid over tting, we propose a widely applicable generalization of maximum likelihood model tting by tempered emfi our approach yields substantial and consistent improvements over latent semantic analysis in a number of ex-perimentsfi surrounding text:( ) i p z =j gives the probability of choosing a word from topics j in the current document, which varies across different documents. hofmann [***]<1> introduced the aspect model probabilistic latent semantic analysis (plsa), in which, topics are modeled as multinomial distributions over words, and documents are assumed to be generated by the activation of multiple topics. blei et al influence:2 type:1 pair index:866 citer id:744 citer title:modeling and predicting personal information dissemination behavior citer abstract:in this paper, we propose a new way to automatically model and predict human behavior of receiving and disseminating information by analyzing the contact and content of personal communications. a personal profile, called communitynet, is established for each individual based on a novel algorithm incorporating contact, content, and time information simultaneously. it can be used for personal social capital management. clusters of communitynets provide a view of informal networks for organization management. our new algorithm is developed based on the combination of dynamic algorithms in the social network field and the semantic content classification methods in the natural language processing and machine learning literatures. we tested communitynets on the enron email corpus and report experimental results including filtering, prediction, and recommendation capabilities. we show that the personal behavior and intention are somewhat predictable based on these models. for instance, "to whom a person is going to send a specific email" can be predicted by ones personal social network and content analysis. experimental results show the prediction accuracy of the proposed adaptive algorithm is 58% better than the social network-based predictions, and is 75% better than an aggregated model based on latent dirichlet allocation with social network enhancement. two online demo systems we developed that allow interactive exploration of communitynet are also discussed citee id:164 citee title:latent dirichlet allocation citee abstract:we describe latent dirichlet allocation (lda), a generative probabilistic model for collections of discrete data such as text corpora. lda is a three-level hierarchical bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. in the context of text modeling, the topic probabilities provide an explicit representation of a document. we present efficient approximate inference techniques based on variational methods and an em algorithm for empirical bayes parameter estimation. we report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic lsi model surrounding text:blei et al. [***]<1> proposed latent dirichlet allocation (lda) to address the problems of plsa that parameterization was susceptible to overfitting and did not provide a straightforward way to infer testing documents. a distribution over topics is sampled from a dirichlet distribution for each document influence:1 type:1 pair index:867 citer id:744 citer title:modeling and predicting personal information dissemination behavior citer abstract:in this paper, we propose a new way to automatically model and predict human behavior of receiving and disseminating information by analyzing the contact and content of personal communications. a personal profile, called communitynet, is established for each individual based on a novel algorithm incorporating contact, content, and time information simultaneously. it can be used for personal social capital management. clusters of communitynets provide a view of informal networks for organization management. our new algorithm is developed based on the combination of dynamic algorithms in the social network field and the semantic content classification methods in the natural language processing and machine learning literatures. we tested communitynets on the enron email corpus and report experimental results including filtering, prediction, and recommendation capabilities. we show that the personal behavior and intention are somewhat predictable based on these models. for instance, "to whom a person is going to send a specific email" can be predicted by ones personal social network and content analysis. experimental results show the prediction accuracy of the proposed adaptive algorithm is 58% better than the social network-based predictions, and is 75% better than an aggregated model based on latent dirichlet allocation with social network enhancement. two online demo systems we developed that allow interactive exploration of communitynet are also discussed citee id:426 citee title:finding scientific topics citee abstract:a first step in identifying the content of a document is determining which topics that document addresses. we describe a generative model for documents, introduced by blei, ng, and jordan , in which each document is generated by choosing a distribution over topics and then choosing each word in the document from a topic selected according to this distribution. we then present a markov chain monte carlo algorithm for inference in this model. we use this algorithm to analyze s from pnas by using bayesian model selection to establish the number of topics. we show that the extracted topics capture meaningful structure in the data, consistent with the class designations provided by the authors of the articles, and outline further applications of this analysis, including identifying hot topics by examining temporal dynamics and tagging abstracts to illustrate semantic content. surrounding text:each word is sampled from a multinomial distribution over words specific to the sampled topic. following the notations in [***]<1>, in lda, d documents containing t topics expressed over w unique words, we can represent p(wz) with a set of t multinomial distributions over the w words, such that ( ) (w) j p w z=j = , and p(z) with a set of d multinomial distribution over the t topics, such that for a word in document d, ( ) (d) j p z=j = . recently, the author-topic (at) model [25]<2> extends lda to include authorship information, trying to recognize which part of the document is contributed by which co-author. this is a problem similar to topic detection tracking [28]<2>. we propose an incremental lda (ilda) algorithm to solve it, in which the number of topics is dynamically updated based on the bayesian model selection principle [***]<1>. the procedures of the algorithm are illustrated as follows: incremental latent dirichlet allocation (ilda) algorithm: input: email streams with timestamp t output: ( ) , w j t , ( ) , d j t for different time period t steps: 1) apply lda on a data set with currently observed emails in a time period t to generate latent topics j z and estimate ( ) ( ) 0 ,0 , w j jt p w z t = and ( ) ( ) 0 ,0 , d j jt p z d t = by equation (4) and (5). 4. 1 topic analysis in the experiment, we applied bayesian model selection [***]<1> to choose the number of topics. in the enron intra-organization emails, there are 26,178 word-terms involved after we apply stopwords removal and stemming, we computed p(wt)for t values of 30, 50, 70, 100, 110, 150 topics and chose t = 100 with the maximum value of log(p(wt))for the experiment influence:2 type:1 pair index:868 citer id:744 citer title:modeling and predicting personal information dissemination behavior citer abstract:in this paper, we propose a new way to automatically model and predict human behavior of receiving and disseminating information by analyzing the contact and content of personal communications. a personal profile, called communitynet, is established for each individual based on a novel algorithm incorporating contact, content, and time information simultaneously. it can be used for personal social capital management. clusters of communitynets provide a view of informal networks for organization management. our new algorithm is developed based on the combination of dynamic algorithms in the social network field and the semantic content classification methods in the natural language processing and machine learning literatures. we tested communitynets on the enron email corpus and report experimental results including filtering, prediction, and recommendation capabilities. we show that the personal behavior and intention are somewhat predictable based on these models. for instance, "to whom a person is going to send a specific email" can be predicted by ones personal social network and content analysis. experimental results show the prediction accuracy of the proposed adaptive algorithm is 58% better than the social network-based predictions, and is 75% better than an aggregated model based on latent dirichlet allocation with social network enhancement. two online demo systems we developed that allow interactive exploration of communitynet are also discussed citee id:454 citee title:the author-topic model for authors and documents citee abstract:we introduce the author-topic model, a generative model for documents that extends latent dirichlet allocation (lda; blei, ng, & jordan, 2003) to include authorship information. each author is associated with a multinomial distribution over topics and each topic is associated with a multinomial distribution over words. a document with multiple authors is modeled as a distribution over topics that is a mixture of the distributions associated with the authors. we apply the model to a collection of 1,700 nips conference papers and 160,000 citeseer s. exact inference is intractable for these datasets and we use gibbs sampling to estimate the topic and author distributions. we compare the performance with two other generative models for documents, which are special cases of the author-topic model: lda (a topic model) and a simple author model in which each author is associated with a distribution over words rather than a distribution over topics. we show topics recovered by the authortopic model, and demonstrate applications to computing similarity between authors and entropy of author output surrounding text:following the notations in [24]<1>, in lda, d documents containing t topics expressed over w unique words, we can represent p(wz) with a set of t multinomial distributions over the w words, such that ( ) (w) j p w z=j = , and p(z) with a set of d multinomial distribution over the t topics, such that for a word in document d, ( ) (d) j p z=j = . recently, the author-topic (at) model [***]<2> extends lda to include authorship information, trying to recognize which part of the document is contributed by which co-author. in a recent unpublished work, mccallum et al influence:3 type:2 pair index:869 citer id:744 citer title:modeling and predicting personal information dissemination behavior citer abstract:in this paper, we propose a new way to automatically model and predict human behavior of receiving and disseminating information by analyzing the contact and content of personal communications. a personal profile, called communitynet, is established for each individual based on a novel algorithm incorporating contact, content, and time information simultaneously. it can be used for personal social capital management. clusters of communitynets provide a view of informal networks for organization management. our new algorithm is developed based on the combination of dynamic algorithms in the social network field and the semantic content classification methods in the natural language processing and machine learning literatures. we tested communitynets on the enron email corpus and report experimental results including filtering, prediction, and recommendation capabilities. we show that the personal behavior and intention are somewhat predictable based on these models. for instance, "to whom a person is going to send a specific email" can be predicted by ones personal social network and content analysis. experimental results show the prediction accuracy of the proposed adaptive algorithm is 58% better than the social network-based predictions, and is 75% better than an aggregated model based on latent dirichlet allocation with social network enhancement. two online demo systems we developed that allow interactive exploration of communitynet are also discussed citee id:536 citee title:expertisenet: relational and evolutionary expert modeling citee abstract:we develop a novel user-centric modeling technology, which can dynamically describe and update a person's expertise profile. in an enterprise environment, the technology can enhance employees' collaboration and productivity by assisting in finding experts, training employees, etc. instead of using the traditional search methods, such as the keyword match, we propose to use relational and evolutionary graph models, which we call expertisenet, to describe and find experts. these expertisenets are used for mining, retrieval, and visualization. we conduct experiments by building expertisenets for researchers from a research paper collection. the experiments demonstrate that expertise mining and matching are more efficiently achieved based on the proposed relational and evolutionary graph models. surrounding text:. in our recent paper, we built user models to explicitly describe a persons expertise by a relational and evolutionary graph representation called expertisetnet [***]<1>. in this paper, we continue exploring this thread, and build a communitynet model which incorporates these three components together for data mining and knowledge management. a big challenge of automatically building evolutionary personal social network is the evolutionary segmentation, which is to detect changes between personal social network cohesive sections. here we apply the same algorithm as we proposed in [***]<1>. for each personal social network in one time period t, we use the exponential random graph model [17]<1> to estimate an underlying distribution to describe the social network influence:1 type:2 pair index:870 citer id:744 citer title:modeling and predicting personal information dissemination behavior citer abstract:in this paper, we propose a new way to automatically model and predict human behavior of receiving and disseminating information by analyzing the contact and content of personal communications. a personal profile, called communitynet, is established for each individual based on a novel algorithm incorporating contact, content, and time information simultaneously. it can be used for personal social capital management. clusters of communitynets provide a view of informal networks for organization management. our new algorithm is developed based on the combination of dynamic algorithms in the social network field and the semantic content classification methods in the natural language processing and machine learning literatures. we tested communitynets on the enron email corpus and report experimental results including filtering, prediction, and recommendation capabilities. we show that the personal behavior and intention are somewhat predictable based on these models. for instance, "to whom a person is going to send a specific email" can be predicted by ones personal social network and content analysis. experimental results show the prediction accuracy of the proposed adaptive algorithm is 58% better than the social network-based predictions, and is 75% better than an aggregated model based on latent dirichlet allocation with social network enhancement. two online demo systems we developed that allow interactive exploration of communitynet are also discussed citee id:751 citee title:on-line new event detection and tracking citee abstract:in this work, we discuss and evaluate solutions to text classification problems associated with the events that are reported in on-line source of news. we present solutions to three related classification problems: new event detection, event clustering, and event tracking. the primary focus of this thesis is new event detection, where the goal is to identify news stories that have not previously reported, in a stream of broadcast news comprising radio, television, and newswire. we present an algorithm for new event detection, and analyze the effects of incorporating domain properties into the classification algorithm. we explore a solution that models the temporal relationship between news stories, and investigate the use of proper noun phrase extraction to capture the who, what, when, and where contained in news. our results for new event detection suggest that previous approaches to document clustering provide a good basis for an approach to new event detection, and that further improvements to classification accuracy are obtained when the domain properties of broadcast news are modeled. new event detection is related to the problem of event clustering, where the goal is to group stories that discuss the same event. we investigate on-line clustering as an approach to new event detection, and re-evaluate existing cluster comparison strategies previously used for document retrieval. our results suggest that these strategies produce different groupings of events, and that the on-line single-link strategy extended with a model for domain properties is faster and more effective than other approaches. in this dissertation, we explore several test representation issues in the context of event tracking, where a classifier for an event is formulated from one or more sample stories. the classifier is used to monitor the subsequent news stream for documents related to the event. surrounding text:in order to model the ptcn, one challenge is how to detect latent topics dynamically and at the same time track the emails related to the old topics. this is a problem similar to topic detection tracking [***]<2>. we propose an incremental lda (ilda) algorithm to solve it, in which the number of topics is dynamically updated based on the bayesian model selection principle [24]<1> influence:3 type:3 pair index:871 citer id:744 citer title:modeling and predicting personal information dissemination behavior citer abstract:in this paper, we propose a new way to automatically model and predict human behavior of receiving and disseminating information by analyzing the contact and content of personal communications. a personal profile, called communitynet, is established for each individual based on a novel algorithm incorporating contact, content, and time information simultaneously. it can be used for personal social capital management. clusters of communitynets provide a view of informal networks for organization management. our new algorithm is developed based on the combination of dynamic algorithms in the social network field and the semantic content classification methods in the natural language processing and machine learning literatures. we tested communitynets on the enron email corpus and report experimental results including filtering, prediction, and recommendation capabilities. we show that the personal behavior and intention are somewhat predictable based on these models. for instance, "to whom a person is going to send a specific email" can be predicted by ones personal social network and content analysis. experimental results show the prediction accuracy of the proposed adaptive algorithm is 58% better than the social network-based predictions, and is 75% better than an aggregated model based on latent dirichlet allocation with social network enhancement. two online demo systems we developed that allow interactive exploration of communitynet are also discussed citee id:246 citee title:empirical analysis of predictive algorithms for collaborative filtering citee abstract:collaborative filtering or recommender systemsuse a database about user preferences topredict additional topics or products a newuser might likefi in this paper we describeseveral algorithms designed for this task, includingtechniques based on correlation coefficients,vector-based similarity calculations,and statistical bayesian methodsfi we comparethe predictive accuracy of the variousmethods in a set of representative problemdomainsfi we use two basic classes of evaluation surrounding text:we can see that it reaches a peak from the end of year 2000 to the beginning of year 2001. from the timeline of enron [***]<3>, we found that california energy crisis occurred at exactly this time period. among the key people related to this topic, jeff dasovich was an enron government relations executive influence:3 type:3 pair index:872 citer id:763 citer title:multiscale topic tomography citer abstract:modeling the evolution of topics with time is of great value in automatic summarization and analysis of large document collections. in this work, we propose a new probabilistic graphical model to address this issue. the new model, which we call the multiscale topic tomography model (mttm), employs non-homogeneous poisson processes to model generation of word-counts. the evolution of topics is modeled through a multi-scale analysis using haar wavelets. one of the new features of the model is its modeling the evolution of topics at various time-scales of resolution, allowing the user to zoom in and out of the time-scales. our experiments on science data using the new model uncovers some interesting patterns in topics. the new model is also comparable to lda in predicting unseen data as demonstrated by our perplexity experiments citee id:381 citee title:correlated topic models citee abstract:topic models, such as latent dirichlet allocation (lda), can be useful tools for the statistical analysis of document collections and other discrete data. the lda model assumes that the words of each document arise from a mixture of topics, each of which is a distribution over the vocabulary. a limitation of lda is the inability to model topic correlation even though, for example, a document about genetics is more likely to also be about disease than x-ray astronomy. this limitation stems from the use of the dirichlet distribution to model the variability among the topic proportions. in this paper we develop the correlated topic model (ctm), where the topic proportions exhibit correlation via the logistic normal distribution . we derive a mean-field variational inference algorithm for approximate posterior inference in this model, which is complicated by the fact that the logistic normal is not conjugate to the multinomial. the ctm gives a better fit than lda on a collection of ocred articles from the journal science. furthermore, the ctm provides a natural way of visualizing and exploring this and other unstructured data sets surrounding text:given a document collection, the lda learns its underlying topics in an unsupervised fashion. in the recent past, several extensions to this model have been proposed such as the hierarchical dirichlet processes [12]<3> model that automatically discovers the number of topics, hidden markov model-lda [5]<3> that integrates topic modeling with syntax, correlated topic models [***]<3> that model pairwise correlations between topics, etc. all the aforementioned models ignore an important factor that reveals a huge amount of information contained in large document collections - time influence:3 type:2 pair index:873 citer id:763 citer title:multiscale topic tomography citer abstract:modeling the evolution of topics with time is of great value in automatic summarization and analysis of large document collections. in this work, we propose a new probabilistic graphical model to address this issue. the new model, which we call the multiscale topic tomography model (mttm), employs non-homogeneous poisson processes to model generation of word-counts. the evolution of topics is modeled through a multi-scale analysis using haar wavelets. one of the new features of the model is its modeling the evolution of topics at various time-scales of resolution, allowing the user to zoom in and out of the time-scales. our experiments on science data using the new model uncovers some interesting patterns in topics. the new model is also comparable to lda in predicting unseen data as demonstrated by our perplexity experiments citee id:164 citee title:latent dirichlet allocation citee abstract:we describe latent dirichlet allocation (lda), a generative probabilistic model for collections of discrete data such as text corpora. lda is a three-level hierarchical bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. in the context of text modeling, the topic probabilities provide an explicit representation of a document. we present efficient approximate inference techniques based on variational methods and an em algorithm for empirical bayes parameter estimation. we report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic lsi model surrounding text:several probabilistic graphical models have been proposed recently, to address this problem. one of the first probabilistic and truly generative models among them is latent dirichlet allocation (lda) [***]<1>. lda models a topic as a multinomial distribution over the vocabulary influence:1 type:1 pair index:874 citer id:763 citer title:multiscale topic tomography citer abstract:modeling the evolution of topics with time is of great value in automatic summarization and analysis of large document collections. in this work, we propose a new probabilistic graphical model to address this issue. the new model, which we call the multiscale topic tomography model (mttm), employs non-homogeneous poisson processes to model generation of word-counts. the evolution of topics is modeled through a multi-scale analysis using haar wavelets. one of the new features of the model is its modeling the evolution of topics at various time-scales of resolution, allowing the user to zoom in and out of the time-scales. our experiments on science data using the new model uncovers some interesting patterns in topics. the new model is also comparable to lda in predicting unseen data as demonstrated by our perplexity experiments citee id:452 citee title:dynamic topic models citee abstract:a family of probabilistic time series models is developed to analyze the time evolution of topics in large document collections. the approach is to use state space models on the natural parameters of the multinomial distributions that represent the topics. variational approximations based on kalman filters and nonparametric wavelet regression are developed to carry out approximate posterior inference over the latent topics. in addition to giving quantitative, predictive models of a sequential corpus, dynamic topic models provide a qualitative window into the contents of a large document collection. the models are demonstrated by analyzing the ocred archives of the journal science from 1880 through 2000 surrounding text:this permits us to analyze the popularity of various topics as a function of time. another proposed model called the dynamic topic models (dtm) [***]<2> takes a slightly different approach. the dtm explicitly models the evolution of topics with time by estimating the topic distribution at various epochs. 1 analysis of science we analyzed a subset of 30,000 articles from science, 250 from each of the 120 years between 1883 and 2002. this is essentially the same data used by blei and lafferty in their experiments with the dtm [***]<2>. we divided the data into 16 chunks, each consisting of 1875 documents. this term converts the probability of a string to the probability of the corresponding counts vector allowing us to directly compare the perplexities of both the models. hence the perplexity numbers we show in the plots for lda may not directly correspond to the values obtained by previous authors [2, ***]<2>. for our experiments, we split the data time wise into 8 chunks each spanning 15 years and comprising 3750 documents as done in section 4. notwithstanding this fact, the multiscale model is still very useful since it allows us to visualize data better, through the multiscale analysis as we have shown earlier. finally, we note that the dtm [***]<2>, on the contrary, reported slightly lower perplexity than lda. unlike the dtm which uses full bayesian estimation, we use map estimation to keep the the algorithm simple. the new approach, based on non-homogeneous poisson processes, combined with multi-scale haar wavelet analysis is a more natural way to do sequence modeling of counts-data than previous approaches. the new model offers us the best features of both the tot [13]<2> and dtm [***]<2> models. while tot models the probability of occurrence of a topic with time, dtm models the evolution of topic content influence:1 type:2 pair index:875 citer id:763 citer title:multiscale topic tomography citer abstract:modeling the evolution of topics with time is of great value in automatic summarization and analysis of large document collections. in this work, we propose a new probabilistic graphical model to address this issue. the new model, which we call the multiscale topic tomography model (mttm), employs non-homogeneous poisson processes to model generation of word-counts. the evolution of topics is modeled through a multi-scale analysis using haar wavelets. one of the new features of the model is its modeling the evolution of topics at various time-scales of resolution, allowing the user to zoom in and out of the time-scales. our experiments on science data using the new model uncovers some interesting patterns in topics. the new model is also comparable to lda in predicting unseen data as demonstrated by our perplexity experiments citee id:576 citee title:gap: a factor model for discrete data citee abstract:we present a probabilistic model for a document corpus that combines many of the desirable features of previous models. the model is called gap for gamma-poisson, the distributions of the first and last random variable. gap is a factor model, that is it gives an approximate factorization of the document-term matrix into a product of matrices and x. these factors have strictly non-negative terms. gap is a generative probabilistic model that assigns finite probabilities to documents in a corpus. it can be computed with an efficient and simple em recurrence. for a suitable choice of parameters, the gap factorization maximizes independence between the factors. so it can be used as an independent-component algorithm adapted to document data. the form of the gap model is empirically as well as analytically motivated. it gives very accurate results as a probabilistic model (measured via perplexity) and as a retrieval model. the gap model projects documents and terms into a low-dimensional space of themes, and models texts as passages of terms on the same theme. categories and subject descriptors surrounding text:the latter is considered a strong ir baseline till date. in the area of text modeling, the gap model [***]<1> proposed by canny uses a combination of gamma and poisson distributions to discover latent topics or themes in document collections. the gamma distribution is used to generate the topic weights vector x in each document, which the author calls theme lengths influence:2 type:2 pair index:876 citer id:763 citer title:multiscale topic tomography citer abstract:modeling the evolution of topics with time is of great value in automatic summarization and analysis of large document collections. in this work, we propose a new probabilistic graphical model to address this issue. the new model, which we call the multiscale topic tomography model (mttm), employs non-homogeneous poisson processes to model generation of word-counts. the evolution of topics is modeled through a multi-scale analysis using haar wavelets. one of the new features of the model is its modeling the evolution of topics at various time-scales of resolution, allowing the user to zoom in and out of the time-scales. our experiments on science data using the new model uncovers some interesting patterns in topics. the new model is also comparable to lda in predicting unseen data as demonstrated by our perplexity experiments citee id:384 citee title:integrating topics and syntax citee abstract:statistical approaches to language learning typically focus on either short-range syntactic dependencies or long-range semantic dependencies between words. we present a generative model that uses both kinds of dependencies, and is capable of simultaneously finding syntactic classes and semantic topics despite having no knowledge of syntax or semantics beyond statistical dependency. this model is competitive on tasks like part-of-speech tagging and document classification with models that exclusively use short- and long-range dependencies respectively surrounding text:given a document collection, the lda learns its underlying topics in an unsupervised fashion. in the recent past, several extensions to this model have been proposed such as the hierarchical dirichlet processes [12]<3> model that automatically discovers the number of topics, hidden markov model-lda [***]<3> that integrates topic modeling with syntax, correlated topic models [1]<3> that model pairwise correlations between topics, etc. all the aforementioned models ignore an important factor that reveals a huge amount of information contained in large document collections - time influence:3 type:2 pair index:877 citer id:763 citer title:multiscale topic tomography citer abstract:modeling the evolution of topics with time is of great value in automatic summarization and analysis of large document collections. in this work, we propose a new probabilistic graphical model to address this issue. the new model, which we call the multiscale topic tomography model (mttm), employs non-homogeneous poisson processes to model generation of word-counts. the evolution of topics is modeled through a multi-scale analysis using haar wavelets. one of the new features of the model is its modeling the evolution of topics at various time-scales of resolution, allowing the user to zoom in and out of the time-scales. our experiments on science data using the new model uncovers some interesting patterns in topics. the new model is also comparable to lda in predicting unseen data as demonstrated by our perplexity experiments citee id:149 citee title:a probabilistic approach to automatic keyword indexing citee abstract:confirms previously published research in concluding that specialty words tend to possess frequency distributions which cannot be described by a single poisson distribution surrounding text:pastwork the poisson distribution, being a natural model for countsdata, has been considered as a potential candidate to model text in the past. one of the earliest models is the 2-poisson model for information retrieval [***]<3>, which generates words from a mixture of two classes called elite and non-elite classes. this model did not achieve empirical success, mainly owing to the lack of good estimation techniques, but inspired a heuristic model called bm25 [11]<3> influence:3 type:2 pair index:878 citer id:763 citer title:multiscale topic tomography citer abstract:modeling the evolution of topics with time is of great value in automatic summarization and analysis of large document collections. in this work, we propose a new probabilistic graphical model to address this issue. the new model, which we call the multiscale topic tomography model (mttm), employs non-homogeneous poisson processes to model generation of word-counts. the evolution of topics is modeled through a multi-scale analysis using haar wavelets. one of the new features of the model is its modeling the evolution of topics at various time-scales of resolution, allowing the user to zoom in and out of the time-scales. our experiments on science data using the new model uncovers some interesting patterns in topics. the new model is also comparable to lda in predicting unseen data as demonstrated by our perplexity experiments citee id:229 citee title:an introduction to variational methods for graphical models citee abstract:this paper presents a tutorial introduction to the use of variational methods for inference and learningin graphical models (bayesian networks and markov random fields). we present a number of examples of graphicalmodels, including the qmr-dt database, the sigmoid belief network, the boltzmann machine, and several variantsof hidden markov models, in which it is infeasible to run exact inference algorithms. we then introduce variationalmethods, which exploit laws of large numbers to transform the original graphical model into a simplified graphicalmodel in which inference is efficient. inference in the simpified model provides bounds on probabilities of interestin the original model. we describe a general framework for generating variational transformations based on convexduality. finally we return to the examples and demonstrate how variational algorithms can be formulated in eachcase. surrounding text:p(n, , ffi) = z {p(n,, )d} p(ffi) = p(n, )p(ffi) (8) 3. 3 variational em since estimating the parameters of the model is intractable, we use variational em to estimate the parameters of the model [***]<1>. we only summarize the results below but the interested reader may refer to appendix a for more details influence:2 type:1 pair index:879 citer id:763 citer title:multiscale topic tomography citer abstract:modeling the evolution of topics with time is of great value in automatic summarization and analysis of large document collections. in this work, we propose a new probabilistic graphical model to address this issue. the new model, which we call the multiscale topic tomography model (mttm), employs non-homogeneous poisson processes to model generation of word-counts. the evolution of topics is modeled through a multi-scale analysis using haar wavelets. one of the new features of the model is its modeling the evolution of topics at various time-scales of resolution, allowing the user to zoom in and out of the time-scales. our experiments on science data using the new model uncovers some interesting patterns in topics. the new model is also comparable to lda in predicting unseen data as demonstrated by our perplexity experiments citee id:278 citee title:bayesian multiscale models for poisson processes citee abstract:the focus of this article is a new class of models, bayesian multi scale models (bmsm's), for phenomena with two defining characteristics: (1) the observed data may be well modeled as a discrete signal (time series) of independent poisson counts, and (2) the underlying intensity function potentially has structure at multiple scales. fundamental to the analysis of such data is the estimation of the underlying intensity function, and it is this problem in particular that is addressed. knowledge of the intensity may be a goal in itself or may serve as an initial step prior to higher-level analyses (e.g., detection or classification). surrounding text:in our work, we use the poisson distribution to model word counts not only because it is a natural choice for countsdata, but also because it is amenable to sequence modeling through bayesian multiscale analysis. bayesian multiscale models for poisson processes were first introduced by kolaczyk [***]<1> and were applied to model physical phenomena such as gamma ray bursts. nowak extended multiscale analysis to build multiscale hidden markov models and applied it to the problem of image segmentation [9]<2> influence:2 type:1 pair index:880 citer id:763 citer title:multiscale topic tomography citer abstract:modeling the evolution of topics with time is of great value in automatic summarization and analysis of large document collections. in this work, we propose a new probabilistic graphical model to address this issue. the new model, which we call the multiscale topic tomography model (mttm), employs non-homogeneous poisson processes to model generation of word-counts. the evolution of topics is modeled through a multi-scale analysis using haar wavelets. one of the new features of the model is its modeling the evolution of topics at various time-scales of resolution, allowing the user to zoom in and out of the time-scales. our experiments on science data using the new model uncovers some interesting patterns in topics. the new model is also comparable to lda in predicting unseen data as demonstrated by our perplexity experiments citee id:762 citee title:multiscale hidden markov models for bayesian image analysis citee abstract:bayesian multiscale image analysis weds the powerful modelingframework of probabilistic graphs with the intuitively appealing andcomputationally tractable multiresolution paradigmfi in addition to providinga very natural and useful framework for modeling and processingimages, bayesian multiscale analysis is often much less computationally demandingcompared to classical markov random field modelsfi this chapterfocuses on a probabilistic graph model called the multiscale hidden markovmodel surrounding text:bayesian multiscale models for poisson processes were first introduced by kolaczyk [8]<1> and were applied to model physical phenomena such as gamma ray bursts. nowak extended multiscale analysis to build multiscale hidden markov models and applied it to the problem of image segmentation [***]<2>. nowak and kolaczyk also presented multiscale analysis for the poisson inverse problem [10]<2>, which is the problem of estimating latent poisson means based on observed poisson data, whose means are related to the latent poisson means by a known linear function influence:1 type:2 pair index:881 citer id:763 citer title:multiscale topic tomography citer abstract:modeling the evolution of topics with time is of great value in automatic summarization and analysis of large document collections. in this work, we propose a new probabilistic graphical model to address this issue. the new model, which we call the multiscale topic tomography model (mttm), employs non-homogeneous poisson processes to model generation of word-counts. the evolution of topics is modeled through a multi-scale analysis using haar wavelets. one of the new features of the model is its modeling the evolution of topics at various time-scales of resolution, allowing the user to zoom in and out of the time-scales. our experiments on science data using the new model uncovers some interesting patterns in topics. the new model is also comparable to lda in predicting unseen data as demonstrated by our perplexity experiments citee id:182 citee title:a statistical multiscale framework for poisson inverse problems citee abstract:this paper describes a statistical modeling and analysis method for linear inverse problemsinvolving poisson data based on a novel multiscale frameworkfi the framework itself is foundedupon a multiscale analysis associated with recursive partitioning of the underlying intensity, acorresponding multiscale factorization of the likelihood (induced by this analysis), and a choiceof prior probability distribution made to match this factorization by modeling the \splits" inthe underlying surrounding text:nowak extended multiscale analysis to build multiscale hidden markov models and applied it to the problem of image segmentation [9]<2>. nowak and kolaczyk also presented multiscale analysis for the poisson inverse problem [***]<2>, which is the problem of estimating latent poisson means based on observed poisson data, whose means are related to the latent poisson means by a known linear function. in this paper, we cast the problem of topic discovery in document collections as a poisson inverse problem. in this paper, we cast the problem of topic discovery in document collections as a poisson inverse problem. unlike in the work of nowak and kolaczyk [***]<2>, we do not assume that the linear relationship between the latent poisson parameters and observed poissons is known, which makes the problem slightly more complex. hence, we use variational approximations to estimate the parameters of the model influence:1 type:2 pair index:882 citer id:763 citer title:multiscale topic tomography citer abstract:modeling the evolution of topics with time is of great value in automatic summarization and analysis of large document collections. in this work, we propose a new probabilistic graphical model to address this issue. the new model, which we call the multiscale topic tomography model (mttm), employs non-homogeneous poisson processes to model generation of word-counts. the evolution of topics is modeled through a multi-scale analysis using haar wavelets. one of the new features of the model is its modeling the evolution of topics at various time-scales of resolution, allowing the user to zoom in and out of the time-scales. our experiments on science data using the new model uncovers some interesting patterns in topics. the new model is also comparable to lda in predicting unseen data as demonstrated by our perplexity experiments citee id:764 citee title:some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval citee abstract:the 2-poisson model for term frequencies is used to suggest ways of incorporatingcertain variablesin probabilisticmodels for informationretrieval.the variables concerned are within-documenttermtkequency, document length, and within-queryterm frequency. simple weighting functionsare devel-oped, and tested on the trec test collection.considerable performance improvements(over simpleinverse collection frequency weighting)are demonstrated. surrounding text:one of the earliest models is the 2-poisson model for information retrieval [6]<3>, which generates words from a mixture of two classes called elite and non-elite classes. this model did not achieve empirical success, mainly owing to the lack of good estimation techniques, but inspired a heuristic model called bm25 [***]<3>. the latter is considered a strong ir baseline till date influence:3 type:3 pair index:883 citer id:763 citer title:multiscale topic tomography citer abstract:modeling the evolution of topics with time is of great value in automatic summarization and analysis of large document collections. in this work, we propose a new probabilistic graphical model to address this issue. the new model, which we call the multiscale topic tomography model (mttm), employs non-homogeneous poisson processes to model generation of word-counts. the evolution of topics is modeled through a multi-scale analysis using haar wavelets. one of the new features of the model is its modeling the evolution of topics at various time-scales of resolution, allowing the user to zoom in and out of the time-scales. our experiments on science data using the new model uncovers some interesting patterns in topics. the new model is also comparable to lda in predicting unseen data as demonstrated by our perplexity experiments citee id:437 citee title:hierarchical dirichlet processes citee abstract:we consider problems involving groups of data, where each observation within a group is a draw from a mixture model, and where it is desirable to share mixture components between groups. we assume that the number of mixture components is unknown a priori and is to be inferred from the data. in this setting it is natural to consider sets of dirichlet processes, one for each group, where the well-known clustering property of the dirichlet process provides a nonparametric prior for the number of mixture components within each group. given our desire to tie the mixture models in the various groups, we consider a hierarchical model, speci.cally one in which the base measure for the child dirichlet processes is itself distributed according to a dirichlet process. such a base measure being discrete, the child dirichlet processes necessar-ily share atoms. thus, as desired, the mixture models in the different groups necessarily share mixture components. we discuss representations of hierarchical dirichlet processes in terms of a stick-breaking process, and a generalization of the chinese restaurant process that we refer to as the chinese restaurant franchise. we present markov chain monte carlo algorithms for posterior inference in hierarchical dirichlet process mixtures, and describe applications to problems in information retrieval and text modelling surrounding text:given a document collection, the lda learns its underlying topics in an unsupervised fashion. in the recent past, several extensions to this model have been proposed such as the hierarchical dirichlet processes [***]<3> model that automatically discovers the number of topics, hidden markov model-lda [5]<3> that integrates topic modeling with syntax, correlated topic models [1]<3> that model pairwise correlations between topics, etc. all the aforementioned models ignore an important factor that reveals a huge amount of information contained in large document collections - time influence:3 type:1 pair index:884 citer id:763 citer title:multiscale topic tomography citer abstract:modeling the evolution of topics with time is of great value in automatic summarization and analysis of large document collections. in this work, we propose a new probabilistic graphical model to address this issue. the new model, which we call the multiscale topic tomography model (mttm), employs non-homogeneous poisson processes to model generation of word-counts. the evolution of topics is modeled through a multi-scale analysis using haar wavelets. one of the new features of the model is its modeling the evolution of topics at various time-scales of resolution, allowing the user to zoom in and out of the time-scales. our experiments on science data using the new model uncovers some interesting patterns in topics. the new model is also comparable to lda in predicting unseen data as demonstrated by our perplexity experiments citee id:765 citee title:topics over time: a non-markov continuous-time model of topical trends citee abstract:this paper presents an lda-style topic model that captures not only the low-dimensional structure of data, but also how the structure changes over time. unlike other recent work that relies on markov assumptions or discretization of time, here each topic is associated with a continuous distribution over timestamps, and for each generated document, the mixture distribution over topics is influenced by both word co-occurrences and the documents timestamp. thus, the meaning of a particular topic can be relied upon as constant, but the topics?occurrence and correlations change significantly over time. we present results on nine months of personal email, 17 years of nips research papers and over 200 years of presidential state-of-the-union addresses, showing improved topics, better timestamp prediction, and interpretable trends surrounding text:several models have been proposed in the recent past to address this issue. one of the models, called topics over time (tot) [***]<2> associates a beta distribution over time to each topic that represents the occurrence probability of that topic at any given time. the model learns the parameters of this distribution for each topic based on the time-stamps of documents associated with that topic in the collection. this statistic is proportional to the occurrence frequency of a topic in a given epoch. we normalized this statistic, so that one can interpret the plot as the probability of occurrence of a topic as a function of time, similar to the plots in [***]<2>. 526 research track paper 1893 a space-relation of numbers 1911 genotype and pure line 1922 spermatogenesis of the garter snake 1941 the artificial synthesis of a 42-chromosome wheat 1949 cytological evidence opposing the theory of brachymeiosis in the ascomycetes 1965 bipolarity of information transfer from the salmonella typhimurium chromosome 1979 distribution of rna transcripts from structural and intervening sequences of the ovalbumin gene 2000 dna replication fork pause sites dependent on transcription table 2: documents in which the topic genetics appears with the highest probability in each of the epochs 1 10 100 1000 1880 1900 1920 1940 1960 1980 2000 poisson rate (log scale) year particle physics blood tests chemistry total count figure 10: occurrence rate of reaction in three different topics: the word reaction could have several meanings depending on the context in which it is used. the new approach, based on non-homogeneous poisson processes, combined with multi-scale haar wavelet analysis is a more natural way to do sequence modeling of counts-data than previous approaches. the new model offers us the best features of both the tot [***]<2> and dtm [3]<2> models. while tot models the probability of occurrence of a topic with time, dtm models the evolution of topic content influence:2 type:2 pair index:885 citer id:765 citer title:topics over time: a non-markov continuous-time model of topical trends citer abstract:this paper presents an lda-style topic model that captures not only the low-dimensional structure of data, but also how the structure changes over time. unlike other recent work that relies on markov assumptions or discretization of time, here each topic is associated with a continuous distribution over timestamps, and for each generated document, the mixture distribution over topics is influenced by both word co-occurrences and the documents timestamp. thus, the meaning of a particular topic can be relied upon as constant, but the topics?occurrence and correlations change significantly over time. we present results on nine months of personal email, 17 years of nips research papers and over 200 years of presidential state-of-the-union addresses, showing improved topics, better timestamp prediction, and interpretable trends citee id:164 citee title:latent dirichlet allocation citee abstract:we describe latent dirichlet allocation (lda), a generative probabilistic model for collections of discrete data such as text corpora. lda is a three-level hierarchical bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. in the context of text modeling, the topic probabilities provide an explicit representation of a document. we present efficient approximate inference techniques based on variational methods and an em algorithm for empirical bayes parameter estimation. we report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic lsi model surrounding text:introduction research in statistical models of co-occurrence has led to the development of a variety of useful topic modelsmechanisms for discovering low-dimensional, multi-faceted summaries of documents or other discrete data. these include models of words alone, such as latent dirichlet allocation (lda) [***, 4]<1>, of words and research paper citations [3]<3>, of word sequences with markov dependencies [5]<3>, of words and their authors [11]<3>, of words in a social network of senders and recipients [9]<3>, and of words and relations (such as voting patterns) [15]<3>. in each case, graphical model structures are carefully-designed to capture the relevant structure and cooccurrence dependencies in the data. our notation is summarized in table 1, and the graphical model representations of both lda and our tot models are shown in figure 1. latent dirichlet allocation (lda) is a bayesian network that generates a document using a mixture of topics [***]<1>. in its generative process, for each document d, a multinomial distribution  over topics is randomly sampled from a dirichlet with parameter , and then to generate each word, a topic zdi is chosen from this topic distribution, and a word, wdi, is generated by randomly sampling from a topic-specific multinomial distribution zdi influence:1 type:1 pair index:886 citer id:765 citer title:topics over time: a non-markov continuous-time model of topical trends citer abstract:this paper presents an lda-style topic model that captures not only the low-dimensional structure of data, but also how the structure changes over time. unlike other recent work that relies on markov assumptions or discretization of time, here each topic is associated with a continuous distribution over timestamps, and for each generated document, the mixture distribution over topics is influenced by both word co-occurrences and the documents timestamp. thus, the meaning of a particular topic can be relied upon as constant, but the topics?occurrence and correlations change significantly over time. we present results on nine months of personal email, 17 years of nips research papers and over 200 years of presidential state-of-the-union addresses, showing improved topics, better timestamp prediction, and interpretable trends citee id:739 citee title:mixed membership models of scientific publications citee abstract:pnas is one of world's most cited multidisciplinary scientific journals. the pnas official classification structure of subjects is reflected in topic labels submitted by the authors of articles, largely related to traditionally established disciplines. these include broad field classifications into physical sciences, biological sciences, social sciences, and further subtopic classifications within the fields. focusing on biological sciences, we explore an internal soft-classification structure of articles based only on semantic decompositions of s and bibliographies and compare it with the formal discipline classifications. our model assumes that there is a fixed number of internal categories, each characterized by multinomial distributions over words (in abstracts) and references (in bibliographies). soft classification for each article is based on proportions of the article's content coming from each category. we discuss the appropriateness of the model for the pnas database as well as other features of the data relevant to soft classification surrounding text:introduction research in statistical models of co-occurrence has led to the development of a variety of useful topic modelsmechanisms for discovering low-dimensional, multi-faceted summaries of documents or other discrete data. these include models of words alone, such as latent dirichlet allocation (lda) [2, 4]<1>, of words and research paper citations [***]<3>, of word sequences with markov dependencies [5]<3>, of words and their authors [11]<3>, of words in a social network of senders and recipients [9]<3>, and of words and relations (such as voting patterns) [15]<3>. in each case, graphical model structures are carefully-designed to capture the relevant structure and cooccurrence dependencies in the data influence:3 type:3,2 pair index:887 citer id:765 citer title:topics over time: a non-markov continuous-time model of topical trends citer abstract:this paper presents an lda-style topic model that captures not only the low-dimensional structure of data, but also how the structure changes over time. unlike other recent work that relies on markov assumptions or discretization of time, here each topic is associated with a continuous distribution over timestamps, and for each generated document, the mixture distribution over topics is influenced by both word co-occurrences and the documents timestamp. thus, the meaning of a particular topic can be relied upon as constant, but the topics?occurrence and correlations change significantly over time. we present results on nine months of personal email, 17 years of nips research papers and over 200 years of presidential state-of-the-union addresses, showing improved topics, better timestamp prediction, and interpretable trends citee id:426 citee title:finding scientific topics citee abstract:a first step in identifying the content of a document is determining which topics that document addresses. we describe a generative model for documents, introduced by blei, ng, and jordan , in which each document is generated by choosing a distribution over topics and then choosing each word in the document from a topic selected according to this distribution. we then present a markov chain monte carlo algorithm for inference in this model. we use this algorithm to analyze s from pnas by using bayesian model selection to establish the number of topics. we show that the extracted topics capture meaningful structure in the data, consistent with the class designations provided by the authors of the articles, and outline further applications of this analysis, including identifying hot topics by examining temporal dynamics and tagging abstracts to illustrate semantic content. surrounding text:introduction research in statistical models of co-occurrence has led to the development of a variety of useful topic modelsmechanisms for discovering low-dimensional, multi-faceted summaries of documents or other discrete data. these include models of words alone, such as latent dirichlet allocation (lda) [2, ***]<1>, of words and research paper citations [3]<3>, of word sequences with markov dependencies [5]<3>, of words and their authors [11]<3>, of words in a social network of senders and recipients [9]<3>, and of words and relations (such as voting patterns) [15]<3>. in each case, graphical model structures are carefully-designed to capture the relevant structure and cooccurrence dependencies in the data influence:2 type:2,1 pair index:888 citer id:765 citer title:topics over time: a non-markov continuous-time model of topical trends citer abstract:this paper presents an lda-style topic model that captures not only the low-dimensional structure of data, but also how the structure changes over time. unlike other recent work that relies on markov assumptions or discretization of time, here each topic is associated with a continuous distribution over timestamps, and for each generated document, the mixture distribution over topics is influenced by both word co-occurrences and the documents timestamp. thus, the meaning of a particular topic can be relied upon as constant, but the topics?occurrence and correlations change significantly over time. we present results on nine months of personal email, 17 years of nips research papers and over 200 years of presidential state-of-the-union addresses, showing improved topics, better timestamp prediction, and interpretable trends citee id:384 citee title:integrating topics and syntax citee abstract:statistical approaches to language learning typically focus on either short-range syntactic dependencies or long-range semantic dependencies between words. we present a generative model that uses both kinds of dependencies, and is capable of simultaneously finding syntactic classes and semantic topics despite having no knowledge of syntax or semantics beyond statistical dependency. this model is competitive on tasks like part-of-speech tagging and document classification with models that exclusively use short- and long-range dependencies respectively surrounding text:introduction research in statistical models of co-occurrence has led to the development of a variety of useful topic modelsmechanisms for discovering low-dimensional, multi-faceted summaries of documents or other discrete data. these include models of words alone, such as latent dirichlet allocation (lda) [2, 4]<1>, of words and research paper citations [3]<3>, of word sequences with markov dependencies [***]<3>, of words and their authors [11]<3>, of words in a social network of senders and recipients [9]<3>, and of words and relations (such as voting patterns) [15]<3>. in each case, graphical model structures are carefully-designed to capture the relevant structure and cooccurrence dependencies in the data influence:3 type:2 pair index:889 citer id:765 citer title:topics over time: a non-markov continuous-time model of topical trends citer abstract:this paper presents an lda-style topic model that captures not only the low-dimensional structure of data, but also how the structure changes over time. unlike other recent work that relies on markov assumptions or discretization of time, here each topic is associated with a continuous distribution over timestamps, and for each generated document, the mixture distribution over topics is influenced by both word co-occurrences and the documents timestamp. thus, the meaning of a particular topic can be relied upon as constant, but the topics?occurrence and correlations change significantly over time. we present results on nine months of personal email, 17 years of nips research papers and over 200 years of presidential state-of-the-union addresses, showing improved topics, better timestamp prediction, and interpretable trends citee id:291 citee title:bursty and hierarchical structure in streams citee abstract:a fundamental problem in text data mining is to extract meaningful structure from document streams that arrive continuously over time. e-mail and news articles are two natural examples of such streams, each characterized by topics that appear, grow in intensity for a period of time, and then fade away. the published literature in a particular research field can be seen to exhibit similar phenomena over a much longer time scale. underlying much of the text mining work in this area is the following intuitive premise --- that the appearance of a topic in a document stream is signaled by a "burst of activity," with certain features rising sharply in frequency as the topic emerges.the goal of the present work is to develop a formal approach for modeling such "bursts," in such a way that they can be robustly and efficiently identified, and can provide an organizational framework for analyzing the underlying content. the approach is based on modeling the stream using an infinite-state automaton, in which bursts appear naturally as state transitions; in some ways, it can be viewed as drawing an analogy with models from queueing theory for bursty network traffic. the resulting algorithms are highly efficient, and yield a nested representation of the set of bursts that imposes a hierarchical structure on the overall stream. experiments with e-mail and research paper archives suggest that the resulting structures have a natural meaning in terms of the content that gave rise to them. surrounding text:another markov model that aims to find word patterns in time is kleinbergs burst of activity model. [***]<2>. this approach uses a probabilistic infinite-state automaton with a particular state structure in which high activity states are reachable only by passing through lower activity states influence:2 type:2 pair index:890 citer id:765 citer title:topics over time: a non-markov continuous-time model of topical trends citer abstract:this paper presents an lda-style topic model that captures not only the low-dimensional structure of data, but also how the structure changes over time. unlike other recent work that relies on markov assumptions or discretization of time, here each topic is associated with a continuous distribution over timestamps, and for each generated document, the mixture distribution over topics is influenced by both word co-occurrences and the documents timestamp. thus, the meaning of a particular topic can be relied upon as constant, but the topics?occurrence and correlations change significantly over time. we present results on nine months of personal email, 17 years of nips research papers and over 200 years of presidential state-of-the-union addresses, showing improved topics, better timestamp prediction, and interpretable trends citee id:89 citee title:a generalized probability density function for double-bounded random processes citee abstract:the author developed in 1976 the sinepower probability density function (sp-pdf) to fit up random processes which are bounded at the lower and upper ends, and which has a mode occurring between these two bounds. this latter condition is now relaxed and a generalized pdf entitled double bounded probability density function (db-pdf) is developed here. methods for the application to practical problems of parameter estimation and to computer simulation of random variables are explained by a numerical example.the author developed in 1976 the sinepower probability density function (sp-pdf) to fit up random processes which are bounded at the lower and upper ends, and which has a mode occurring between these two bounds. this latter condition is now relaxed and a generalized pdf entitled double bounded probability density function (db-pdf) is developed here. methods for the application to practical problems of parameter estimation and to computer simulation of random variables are explained by a numerical example. surrounding text:all the results in this paper employ the beta distribution (which can behave versatile shapes), for which the time range of the data used for parameter estimation is normalized to a range from 0 to 1. another possible choice of bounded distributions is the kumaraswamy distribution [***]<3>. double-bounded distributions  fi z nd w d  t  fi z nd w d  t t  fi z nd w d  t t (a) lda (b) tot model, (c) tot model, alternate view for gibbs sampling figure 1: three topic models: lda and two perspectives on tot symbol description t number of topics d number of documents v number of unique words nd number of word tokens in document d d the multinomial distribution of topics specific to the document d z the multinomial distribution of words specific to topic z z the beta distribution of time specific to topic z zdi the topic associated with the ith token in the document d wdi the ith token in document d tdi the timestamp associated with the ith token in the document d (in figure 1(c)) table 1: notation used in this paper are appropriate because the training data are bounded in time influence:3 type:1 pair index:891 citer id:765 citer title:topics over time: a non-markov continuous-time model of topical trends citer abstract:this paper presents an lda-style topic model that captures not only the low-dimensional structure of data, but also how the structure changes over time. unlike other recent work that relies on markov assumptions or discretization of time, here each topic is associated with a continuous distribution over timestamps, and for each generated document, the mixture distribution over topics is influenced by both word co-occurrences and the documents timestamp. thus, the meaning of a particular topic can be relied upon as constant, but the topics?occurrence and correlations change significantly over time. we present results on nine months of personal email, 17 years of nips research papers and over 200 years of presidential state-of-the-union addresses, showing improved topics, better timestamp prediction, and interpretable trends citee id:753 citee title:modeling word burstiness using the dirichlet distribution citee abstract:multinomial distributions are often used to model text documents. however, they do not capture well the phenomenon that words in a document tend to appear in bursts: if a word appears once, it is more likely to appear again. in this paper, we propose the dirichlet compound multinomial model (dcm) as an alternative to the multinomial. the dcm model has one additional degree of freedom, which allows it to capture burstiness. we show experimentally that the dcm is substantially better than the multinomial at modeling text data, measured by perplexity. we also show using three standard document collections that the dcm leads to better classification than the multinomial model. dcm performance is comparable to that obtained with multiple heuristic changes to the multinomial model. surrounding text:although, in the above generative process, a timestamp is generated for each word token, all the timestamps of the words in a document are observed as the same as the timestamp of the document. one might also be interested in capturing burstiness, and some solution such as dirichlet compound multinomial model (dcm) can be easily integrated into the tot model [***]<1>. in our experiments there are a fixed number of topics, t influence:2 type:1 pair index:892 citer id:765 citer title:topics over time: a non-markov continuous-time model of topical trends citer abstract:this paper presents an lda-style topic model that captures not only the low-dimensional structure of data, but also how the structure changes over time. unlike other recent work that relies on markov assumptions or discretization of time, here each topic is associated with a continuous distribution over timestamps, and for each generated document, the mixture distribution over topics is influenced by both word co-occurrences and the documents timestamp. thus, the meaning of a particular topic can be relied upon as constant, but the topics?occurrence and correlations change significantly over time. we present results on nine months of personal email, 17 years of nips research papers and over 200 years of presidential state-of-the-union addresses, showing improved topics, better timestamp prediction, and interpretable trends citee id:592 citee title:topic and role discovery in social networks citee abstract: previous work in social network analysis (sna) has modeled the existence of links from one en - tity to another, but not the language content or topics on those links we present the author - recipient - topic (art) model for social network analysis, which learns topic distributions based on the direction - sensitive messages sent between en - tities the model builds on latent dirichlet al - location (lda) and the author - topic (at) model, adding the key attribute that distribution over top - ics is conditioned distinctly on both the sender and recipient steering the discovery of topics accord - - ing to the relationships between people we give surrounding text:introduction research in statistical models of co-occurrence has led to the development of a variety of useful topic modelsmechanisms for discovering low-dimensional, multi-faceted summaries of documents or other discrete data. these include models of words alone, such as latent dirichlet allocation (lda) [2, 4]<1>, of words and research paper citations [3]<3>, of word sequences with markov dependencies [5]<3>, of words and their authors [11]<3>, of words in a social network of senders and recipients [***]<3>, and of words and relations (such as voting patterns) [15]<3>. in each case, graphical model structures are carefully-designed to capture the relevant structure and cooccurrence dependencies in the data influence:3 type:2 pair index:893 citer id:765 citer title:topics over time: a non-markov continuous-time model of topical trends citer abstract:this paper presents an lda-style topic model that captures not only the low-dimensional structure of data, but also how the structure changes over time. unlike other recent work that relies on markov assumptions or discretization of time, here each topic is associated with a continuous distribution over timestamps, and for each generated document, the mixture distribution over topics is influenced by both word co-occurrences and the documents timestamp. thus, the meaning of a particular topic can be relied upon as constant, but the topics?occurrence and correlations change significantly over time. we present results on nine months of personal email, 17 years of nips research papers and over 200 years of presidential state-of-the-union addresses, showing improved topics, better timestamp prediction, and interpretable trends citee id:373 citee title:continuous time bayesian networks citee abstract:in this paper we present a language for finite state continuous time bayesian networks (ctbns), which describe structured stochastic processes that evolve over continuous time. the state of the system is decomposed into a set of local variables whose values change over time. the dynamics of the system are described by specifying the behavior of each local variable as a function of its parents in a directed (possibly cyclic) graph. the model specifies, at any given point in time, the distribution over two aspects: when a local variable changes its value and the next value it takes. these distributions are determined by the variable's current value and the current values of its parents in the graph. more formally, each variable is modelled as a finite state continuous time markov process whose transition intensities are functions of its parents. we present a probabilistic semantics for the language in terms of the generative model a ctbn defines over sequences of events. we list types of queries one might ask of a ctbn, discuss the conceptual and computational difficulties associated with exact inference, and provide an algorithm for approximate inference which takes advantage of the structure within the process surrounding text:(or word associations) of a topic changes over time. the continuous time bayesian network (ctbn) [***]<2> is an example of using continuous time without discretization. a ctbn consists of two components: a bayesian network and a continuous transition model, which avoids various granularity problem due to discretization influence:1 type:2 pair index:894 citer id:765 citer title:topics over time: a non-markov continuous-time model of topical trends citer abstract:this paper presents an lda-style topic model that captures not only the low-dimensional structure of data, but also how the structure changes over time. unlike other recent work that relies on markov assumptions or discretization of time, here each topic is associated with a continuous distribution over timestamps, and for each generated document, the mixture distribution over topics is influenced by both word co-occurrences and the documents timestamp. thus, the meaning of a particular topic can be relied upon as constant, but the topics?occurrence and correlations change significantly over time. we present results on nine months of personal email, 17 years of nips research papers and over 200 years of presidential state-of-the-union addresses, showing improved topics, better timestamp prediction, and interpretable trends citee id:454 citee title:the author-topic model for authors and documents citee abstract:we introduce the author-topic model, a generative model for documents that extends latent dirichlet allocation (lda; blei, ng, & jordan, 2003) to include authorship information. each author is associated with a multinomial distribution over topics and each topic is associated with a multinomial distribution over words. a document with multiple authors is modeled as a distribution over topics that is a mixture of the distributions associated with the authors. we apply the model to a collection of 1,700 nips conference papers and 160,000 citeseer s. exact inference is intractable for these datasets and we use gibbs sampling to estimate the topic and author distributions. we compare the performance with two other generative models for documents, which are special cases of the author-topic model: lda (a topic model) and a simple author model in which each author is associated with a distribution over words rather than a distribution over topics. we show topics recovered by the authortopic model, and demonstrate applications to computing similarity between authors and entropy of author output surrounding text:introduction research in statistical models of co-occurrence has led to the development of a variety of useful topic modelsmechanisms for discovering low-dimensional, multi-faceted summaries of documents or other discrete data. these include models of words alone, such as latent dirichlet allocation (lda) [2, 4]<1>, of words and research paper citations [3]<3>, of word sequences with markov dependencies [5]<3>, of words and their authors [***]<3>, of words in a social network of senders and recipients [9]<3>, and of words and relations (such as voting patterns) [15]<3>. in each case, graphical model structures are carefully-designed to capture the relevant structure and cooccurrence dependencies in the data influence:3 type:2 pair index:895 citer id:765 citer title:topics over time: a non-markov continuous-time model of topical trends citer abstract:this paper presents an lda-style topic model that captures not only the low-dimensional structure of data, but also how the structure changes over time. unlike other recent work that relies on markov assumptions or discretization of time, here each topic is associated with a continuous distribution over timestamps, and for each generated document, the mixture distribution over topics is influenced by both word co-occurrences and the documents timestamp. thus, the meaning of a particular topic can be relied upon as constant, but the topics?occurrence and correlations change significantly over time. we present results on nine months of personal email, 17 years of nips research papers and over 200 years of presidential state-of-the-union addresses, showing improved topics, better timestamp prediction, and interpretable trends citee id:448 citee title:dynamic social network analysis using latent space models citee abstract:this paper explores two aspects of social network modeling. first, we generalize a successful static model of relationships into a dynamic model that accounts for friendships drifting over time. second, we show how to make it tractable to learn such models from data, even as the number of entities n gets large. the generalized model associates each entity with a point in p-dimensional euclidean latent space. the points can move as time progresses but large moves in latent space are improbable. observed links between entities are more likely if the entities are close in latent space. we show how to make such a model tractable (sub-quadratic in the number of entities) by the use of appropriate kernel functions for similarity in latent space; the use of low dimensional kd-trees; a new efficient dynamic adaptation of multidimensional scaling for a first pass of approximate projection of entities into latent space; and an efficient conjugate gradient update rule for non-linear local optimization in which amortized time per entity during an update is o(log n). we use both synthetic and real-world data on up to 11,000 entities which indicate near-linear scaling in computation time and improved performance over four alternative approaches. we also illustrate the system operating on twelve years of nips co-authorship data surrounding text:hidden markov models and kalman filters are two such examples. for instance, recent work in social network analysis [***]<2> proposes a dynamic model that accounts for friendships drifting over time. blei and lafferty present a version of their ctm in which the alignment among topics across time steps is modeled by a kalman filter on the gaussian distribution in the logistic normal distribution [1]<2> influence:2 type:2 pair index:896 citer id:765 citer title:topics over time: a non-markov continuous-time model of topical trends citer abstract:this paper presents an lda-style topic model that captures not only the low-dimensional structure of data, but also how the structure changes over time. unlike other recent work that relies on markov assumptions or discretization of time, here each topic is associated with a continuous distribution over timestamps, and for each generated document, the mixture distribution over topics is influenced by both word co-occurrences and the documents timestamp. thus, the meaning of a particular topic can be relied upon as constant, but the topics?occurrence and correlations change significantly over time. we present results on nine months of personal email, 17 years of nips research papers and over 200 years of presidential state-of-the-union addresses, showing improved topics, better timestamp prediction, and interpretable trends citee id:744 citee title:modeling and predicting personal information dissemination behavior citee abstract:in this paper, we propose a new way to automatically model and predict human behavior of receiving and disseminating information by analyzing the contact and content of personal communications. a personal profile, called communitynet, is established for each individual based on a novel algorithm incorporating contact, content, and time information simultaneously. it can be used for personal social capital management. clusters of communitynets provide a view of informal networks for organization management. our new algorithm is developed based on the combination of dynamic algorithms in the social network field and the semantic content classification methods in the natural language processing and machine learning literatures. we tested communitynets on the enron email corpus and report experimental results including filtering, prediction, and recommendation capabilities. we show that the personal behavior and intention are somewhat predictable based on these models. for instance, "to whom a person is going to send a specific email" can be predicted by ones personal social network and content analysis. experimental results show the prediction accuracy of the proposed adaptive algorithm is 58% better than the social network-based predictions, and is 75% better than an aggregated model based on latent dirichlet allocation with social network enhancement. two online demo systems we developed that allow interactive exploration of communitynet are also discussed surrounding text:each segment was fit with the gt model, and trends were compared. one difficulty with this approach is that aligning the topics from each time slice can be difficult, although starting gibbs sampling using parameters from the previous time slice can help, as shown in [***]<2>. somehow similarly, the timemines system [14]<2> for some tdt tasks (single topic in each document) tries to construct overview timelines of a set of news stories influence:3 type:2 pair index:897 citer id:765 citer title:topics over time: a non-markov continuous-time model of topical trends citer abstract:this paper presents an lda-style topic model that captures not only the low-dimensional structure of data, but also how the structure changes over time. unlike other recent work that relies on markov assumptions or discretization of time, here each topic is associated with a continuous distribution over timestamps, and for each generated document, the mixture distribution over topics is influenced by both word co-occurrences and the documents timestamp. thus, the meaning of a particular topic can be relied upon as constant, but the topics?occurrence and correlations change significantly over time. we present results on nine months of personal email, 17 years of nips research papers and over 200 years of presidential state-of-the-union addresses, showing improved topics, better timestamp prediction, and interpretable trends citee id:862 citee title:timemines: constructing timelines with statistical models of word usage citee abstract:we present a system, timemines, that automatically generatestimelines from date-tagged free text corporafi timeminesdetects, ranks, and groups semantic features based ontheir statistical propertiesfi we use these features to discoversets of related stories that deal with a single topicfi surrounding text:one difficulty with this approach is that aligning the topics from each time slice can be difficult, although starting gibbs sampling using parameters from the previous time slice can help, as shown in [13]<2>. somehow similarly, the timemines system [***]<2> for some tdt tasks (single topic in each document) tries to construct overview timelines of a set of news stories. 2 test is performed to identify days on which the number of occurrences of named entities or noun phrases produces a statistic above a given threshold influence:3 type:2 pair index:898 citer id:765 citer title:topics over time: a non-markov continuous-time model of topical trends citer abstract:this paper presents an lda-style topic model that captures not only the low-dimensional structure of data, but also how the structure changes over time. unlike other recent work that relies on markov assumptions or discretization of time, here each topic is associated with a continuous distribution over timestamps, and for each generated document, the mixture distribution over topics is influenced by both word co-occurrences and the documents timestamp. thus, the meaning of a particular topic can be relied upon as constant, but the topics?occurrence and correlations change significantly over time. we present results on nine months of personal email, 17 years of nips research papers and over 200 years of presidential state-of-the-union addresses, showing improved topics, better timestamp prediction, and interpretable trends citee id:586 citee title:group and topic discovery from relations and text citee abstract:we present a probabilistic generative model of entity relationships and textual attributes that simultaneously discovers groups among the entities and topics among the corresponding text. block-models of relationship data have been studied in social network analysis for some time. here we simultaneously cluster in several modalities at once, incorporating the words associated with certain relationships. significantly, joint inference allows the discovery of groups to be guided by the emerging topics, and vice-versa. we present experimental results on two large data sets: sixteen years of bills put before the u.s. senate, comprising their corresponding text and voting records, and 43 years of similar data from the united nations. we show that in comparison with traditional, separate latent-variable models for words or blockstructures for votes, the group-topic models joint inference improves both the groups and topics discovered. categories and subject descriptors surrounding text:introduction research in statistical models of co-occurrence has led to the development of a variety of useful topic modelsmechanisms for discovering low-dimensional, multi-faceted summaries of documents or other discrete data. these include models of words alone, such as latent dirichlet allocation (lda) [2, 4]<1>, of words and research paper citations [3]<3>, of word sequences with markov dependencies [5]<3>, of words and their authors [11]<3>, of words in a social network of senders and recipients [9]<3>, and of words and relations (such as voting patterns) [***]<3>. in each case, graphical model structures are carefully-designed to capture the relevant structure and cooccurrence dependencies in the data influence:3 type:2 pair index:899 citer id:777 citer title:opinion extraction and semantic classification of product reviews citer abstract:the web contains a wealth of product reviews, but sifting through them is a daunting task. ideally, an opinion mining tool would process a set of search results for a given item, generating a list of product attributes (quality, features, etc.) and aggregating opinions about each of them (poor, mixed, good). we begin by identifying the unique properties of this problem and develop a method for automatically distinguishing between positive and negative reviews. our classifier draws on information retrieval techniques for feature extraction and scoring, and the results for various metrics and heuristics vary depending on the testing situation. the best methods work as well as or better than traditional machine learning. when operating on individual sentences collected from web searches, performance is limited due to noise and ambiguity. but in the context of a complete web-based tool and aided by a simple method for grouping sentences into attributes, the results are qualitatively quite useful citee id:778 citee title:yahoo! for amazon: sentiment parsing from small talk on the web citee abstract:the internet has made it feasible to tap a continuous stream of public sentiment from the world wide web, quite literally permitting one to "feel the pulse" of any issue under consideration. we present a methodology for real time sentiment extraction in the domain of finance. with the advent of the web, there has been a sharp increase in the influence of individuals on the stock market via web-based trading and the posting of sentiment to stock message boards. while it is important to capture this "sentiment" of small investors, as yet, no index of sentiment has been compiled. this paper comprises (a) a technology for extracting small investor sentiment from web sources to create an index, and (b) illustrative applications of the methodology. we make use of computerized natural language and statistical algorithms for the automated classification of messages posted on the web. we design a suite of classification algorithms, each of different theoretical content, with a view to characterizing the sentiment of any single posting to a message board. the use of multiple methods allows imposition of voting rules in the classification process. it also enables elimination of "fuzzy" messages which are better off uninterpreted. a majority rule across algorithms vastly improves classification accuracy, but also leads to a natural increase in the number of messages classified as "fuzzy". the classifier achieves an accuracy of 62% (versus a random classification accuracy of 33%), and compares favorably against human agreement on message classification, which was 72%. the technology is computationally efficient, allowing the access and interpretations of thousands of messages within minutes. our illustrative applications show evidence of a strong link between market movements and sentiment. based on approximately 25,000 messages for the last quarter of 2000, we found evidence that sentiment is based on stock movements. surrounding text:3. 2 recommendations at a more applied level, das and chen [***]<2> used a classifier on investor bulletin boards to see if apparently positive postings were correlated with stock price. several scoring methods were employed in conjunction with a manually crafted lexicon, but the best performance came from a combination of techniques influence:3 type:2 pair index:900 citer id:777 citer title:opinion extraction and semantic classification of product reviews citer abstract:the web contains a wealth of product reviews, but sifting through them is a daunting task. ideally, an opinion mining tool would process a set of search results for a given item, generating a list of product attributes (quality, features, etc.) and aggregating opinions about each of them (poor, mixed, good). we begin by identifying the unique properties of this problem and develop a method for automatically distinguishing between positive and negative reviews. our classifier draws on information retrieval techniques for feature extraction and scoring, and the results for various metrics and heuristics vary depending on the testing situation. the best methods work as well as or better than traditional machine learning. when operating on individual sentences collected from web searches, performance is limited due to noise and ambiguity. but in the context of a complete web-based tool and aided by a simple method for grouping sentences into attributes, the results are qualitatively quite useful citee id:281 citee title:beyond word n-grams citee abstract:we describe, analyze, and evaluate experimentally a new probabilistic model for word-sequence prediction in natural language based on prediction suffix trees (psts). by using efficient data structures, we extend the notion of pst to unbounded vocabularies. we also show how to use a bayesian approach based on recursive priors over all possible psts to efficiently maintain tree mixtures. these mixtures have provably and practically better performance than almost any single model. we evaluate the model on several corpora. the low perplexity achieved by relatively small pst mixture models suggests that they may be an advantageous alternative, both theoretically and practically, to the widely used n-gram models. surrounding text:to improve performance, we drew on churchs suffix array algorithm [27]<3>. future work might incorporate techniques from probabilistic suffix trees [***]<3>. scoring method test 1 test 2 trigram baseline 88 influence:3 type:3 pair index:901 citer id:777 citer title:opinion extraction and semantic classification of product reviews citer abstract:the web contains a wealth of product reviews, but sifting through them is a daunting task. ideally, an opinion mining tool would process a set of search results for a given item, generating a list of product attributes (quality, features, etc.) and aggregating opinions about each of them (poor, mixed, good). we begin by identifying the unique properties of this problem and develop a method for automatically distinguishing between positive and negative reviews. our classifier draws on information retrieval techniques for feature extraction and scoring, and the results for various metrics and heuristics vary depending on the testing situation. the best methods work as well as or better than traditional machine learning. when operating on individual sentences collected from web searches, performance is limited due to noise and ambiguity. but in the context of a complete web-based tool and aided by a simple method for grouping sentences into attributes, the results are qualitatively quite useful citee id:584 citee title:genre classification and domain transfer for information filtering citee abstract:the world wide web is a vast repository of information, but the sheer volume makes it dificult to identify useful documents.w e identify document genre is an important factor in retrieving useful documents and focus on the novel document genre dimension of subjectivity. we investigate three approaches to automatically classifying documents by genre: traditional bag of words techniques, part-of-speech statistics, and hand-crafted shallow linguistic features. we are particularly interested in domain transfer: how well the learned classifiers generalize from the training corpus to a new document corpus.our experiments demonstrate that the part-of-speech approach is better than traditional bag of words techniques, particularly in the domain transfer conditions. surrounding text:1 objectivity classification the task of separating reviews from other types of content is a genre or style classification problem. it involves identifying subjectivity, which finn et al$ [***]<2> attempted to do on a set of articles spidered from the web. a classifier based on the relative frequency of each part of speech in a document outperformed bag-of-words and custom-built features influence:2 type:3 pair index:902 citer id:777 citer title:opinion extraction and semantic classification of product reviews citer abstract:the web contains a wealth of product reviews, but sifting through them is a daunting task. ideally, an opinion mining tool would process a set of search results for a given item, generating a list of product attributes (quality, features, etc.) and aggregating opinions about each of them (poor, mixed, good). we begin by identifying the unique properties of this problem and develop a method for automatically distinguishing between positive and negative reviews. our classifier draws on information retrieval techniques for feature extraction and scoring, and the results for various metrics and heuristics vary depending on the testing situation. the best methods work as well as or better than traditional machine learning. when operating on individual sentences collected from web searches, performance is limited due to noise and ambiguity. but in the context of a complete web-based tool and aided by a simple method for grouping sentences into attributes, the results are qualitatively quite useful citee id:585 citee title:good-turing smoothing without tears citee abstract:the performance of statistically based techniques for many tasks such as spelling correction, sense disambiguation, and translation is improved if one can estimate a probability for an object of interest which has not been seen before. good-turing methods are one means of estimating these probabilities for previously unseen objects. however, the use of good-turing methods requires a smoothing step which must smooth in regions of vastly different accuracy. such smoothers are difficult to use surrounding text:because some values of 4 a are unusually low or high in our sample, pre-smoothing is required. we utilized the simple good-turing code from sampson where the values are smoothed with a loglinear curve [***]<3>. we also used add-one smoothing so that all frequencies were non-zero influence:3 type:3 pair index:903 citer id:777 citer title:opinion extraction and semantic classification of product reviews citer abstract:the web contains a wealth of product reviews, but sifting through them is a daunting task. ideally, an opinion mining tool would process a set of search results for a given item, generating a list of product attributes (quality, features, etc.) and aggregating opinions about each of them (poor, mixed, good). we begin by identifying the unique properties of this problem and develop a method for automatically distinguishing between positive and negative reviews. our classifier draws on information retrieval techniques for feature extraction and scoring, and the results for various metrics and heuristics vary depending on the testing situation. the best methods work as well as or better than traditional machine learning. when operating on individual sentences collected from web searches, performance is limited due to noise and ambiguity. but in the context of a complete web-based tool and aided by a simple method for grouping sentences into attributes, the results are qualitatively quite useful citee id:459 citee title:effects of adjective orientation and gradability on sentence subjectivity citee abstract:subjectivity is a pragmatic, sentence-level feature that has important implications for texl processing applicalions such as information exlractiou and information iclricwd. we study tile elfeels of dymunic adjectives, semantically oriented adjectives, and gradable ad.ieclivcs on a simple subjectivity classiiicr, and establish lhat lhcy arc strong predictors of subjectivity. a novel trainable mclhod thai statistically combines two indicators of gradability is presented and ewlhlalcd, complementing exisling automatic icchniques for assigning orientation labels. surrounding text:hatzivassiloglou and mckeown [5]<2> used textual conjunctions such as fair and legitimate or simplistic but well-received to separate similarly- and oppositely-connoted words. other studies showed that restricting features used for classification to those adjectives that come through as strongly dynamic, gradable, or oriented improved performance in the genre-classification task [***, 24]<2>. turney and littman [23]<2> determined the similarity between two words by counting the number of results returned by web searches joining the words with a near operator influence:2 type:3 pair index:904 citer id:777 citer title:opinion extraction and semantic classification of product reviews citer abstract:the web contains a wealth of product reviews, but sifting through them is a daunting task. ideally, an opinion mining tool would process a set of search results for a given item, generating a list of product attributes (quality, features, etc.) and aggregating opinions about each of them (poor, mixed, good). we begin by identifying the unique properties of this problem and develop a method for automatically distinguishing between positive and negative reviews. our classifier draws on information retrieval techniques for feature extraction and scoring, and the results for various metrics and heuristics vary depending on the testing situation. the best methods work as well as or better than traditional machine learning. when operating on individual sentences collected from web searches, performance is limited due to noise and ambiguity. but in the context of a complete web-based tool and aided by a simple method for grouping sentences into attributes, the results are qualitatively quite useful citee id:417 citee title:direction-based text interpretation as an information access refinement citee abstract:a text-based intelligent system should provide more in-depth information about the contents of its corpus than does a standard information retrieval system, while at the same time avoiding the complexity and resource-consuming behavior of detailed text understanders. instead of focusing on discovering documents that pertain to some topic of interest to the user, an approach is introduced based on the criterion of directionality (e.g., is the agent in favor of, neutral, or opposed to the event?). a method is described for coercing sentence meanings into a metaphoric model such that the only semantic interpretation needed in order to determine the directionality of a sentence is done with respect to the model. this interpretation method is designed to be an integrated component of a hybrid information access system. surrounding text:is the agent in favor of, neutral or opposed to the event. ) which converts linguistic pieces into roles in a metaphoric model of motion, with labels like block or enable [***]<2>. recently, liu et al influence:3 type:3 pair index:905 citer id:777 citer title:opinion extraction and semantic classification of product reviews citer abstract:the web contains a wealth of product reviews, but sifting through them is a daunting task. ideally, an opinion mining tool would process a set of search results for a given item, generating a list of product attributes (quality, features, etc.) and aggregating opinions about each of them (poor, mixed, good). we begin by identifying the unique properties of this problem and develop a method for automatically distinguishing between positive and negative reviews. our classifier draws on information retrieval techniques for feature extraction and scoring, and the results for various metrics and heuristics vary depending on the testing situation. the best methods work as well as or better than traditional machine learning. when operating on individual sentences collected from web searches, performance is limited due to noise and ambiguity. but in the context of a complete web-based tool and aided by a simple method for grouping sentences into attributes, the results are qualitatively quite useful citee id:270 citee title:automatic retrieval and clustering of similar words citee abstract:bootstrapping semantics from text is one of the greatest challenges in natural language learning. earlier research showed that it is possible to automatically identify words that are semantically similar to a given word based on the syntactic collocation patterns of the words. we present an approach that goes a step further by obtaining a tree structure among the most similar words so that different senses of a given word can be identified with different subtrees. submission type: paper topic surrounding text:this area of work is related to the general problem of word clustering. lin [***]<3> and pereira et al. [16]<3> used linguistic colocations to group words with similar uses or meanings influence:3 type:3 pair index:906 citer id:777 citer title:opinion extraction and semantic classification of product reviews citer abstract:the web contains a wealth of product reviews, but sifting through them is a daunting task. ideally, an opinion mining tool would process a set of search results for a given item, generating a list of product attributes (quality, features, etc.) and aggregating opinions about each of them (poor, mixed, good). we begin by identifying the unique properties of this problem and develop a method for automatically distinguishing between positive and negative reviews. our classifier draws on information retrieval techniques for feature extraction and scoring, and the results for various metrics and heuristics vary depending on the testing situation. the best methods work as well as or better than traditional machine learning. when operating on individual sentences collected from web searches, performance is limited due to noise and ambiguity. but in the context of a complete web-based tool and aided by a simple method for grouping sentences into attributes, the results are qualitatively quite useful citee id:142 citee title:a model of textual affect sensing using real-world knowledge citee abstract:this paper presents a novel way for assessing the affective qualities of natural language and a scenario for its use. previous approaches to textual affect sensing have employed keyword spotting, lexical affinity, statistical methods, and hand-crafted models. this paper demonstrates a new approach, using large-scale real-world knowledge about the inherent affective nature of everyday situations (such as "getting into a car accident") to classify sentences into "basic" emotion categories. this surrounding text:recently, liu et al. [***]<2> used relationships from the open mind commonsense database and manually-specified ground truth to assign scalar affect values to linguistic features. these corresponded to six basic emotions (happy, sad, anger, fear, disgust, surprise) influence:3 type:3 pair index:907 citer id:777 citer title:opinion extraction and semantic classification of product reviews citer abstract:the web contains a wealth of product reviews, but sifting through them is a daunting task. ideally, an opinion mining tool would process a set of search results for a given item, generating a list of product attributes (quality, features, etc.) and aggregating opinions about each of them (poor, mixed, good). we begin by identifying the unique properties of this problem and develop a method for automatically distinguishing between positive and negative reviews. our classifier draws on information retrieval techniques for feature extraction and scoring, and the results for various metrics and heuristics vary depending on the testing situation. the best methods work as well as or better than traditional machine learning. when operating on individual sentences collected from web searches, performance is limited due to noise and ambiguity. but in the context of a complete web-based tool and aided by a simple method for grouping sentences into attributes, the results are qualitatively quite useful citee id:288 citee title:bow: a toolkit for statistical language modeling, text retrieval, classification and clustering citee abstract:bow (or libbow) is a library of c code useful for writing statistical text analysis, language modeling and information retrieval programs. the current distribution includes the library, as well as front-ends for document classification (rainbow), document retrieval (arrow) and document clustering (crossbow). surrounding text:6 scoring after selecting a set of features k %llllk  and optionally smoothing their probabilities, we must assign them scores, used to place test documents in the set of positive reviews  or negative reviews  . we tried some machine-learning techniques using the rainbow text-classification package [***]<3>, but table 9 shows the performance was no better than our method. we also tried svmfiffi , the package2 used by pang et al influence:3 type:3 pair index:908 citer id:777 citer title:opinion extraction and semantic classification of product reviews citer abstract:the web contains a wealth of product reviews, but sifting through them is a daunting task. ideally, an opinion mining tool would process a set of search results for a given item, generating a list of product attributes (quality, features, etc.) and aggregating opinions about each of them (poor, mixed, good). we begin by identifying the unique properties of this problem and develop a method for automatically distinguishing between positive and negative reviews. our classifier draws on information retrieval techniques for feature extraction and scoring, and the results for various metrics and heuristics vary depending on the testing situation. the best methods work as well as or better than traditional machine learning. when operating on individual sentences collected from web searches, performance is limited due to noise and ambiguity. but in the context of a complete web-based tool and aided by a simple method for grouping sentences into attributes, the results are qualitatively quite useful citee id:562 citee title:feature subset selection in text-learning citee abstract:this paper describes several known and some new methods for feature subset selection on large text data. experimental comparison given on real-world data collected from web users shows that characteristics of the problem domain and machine learning algorithm should be considered when feature scoring measure is selected. our problem domain consists of hyperlinks given in a form of small-documents represented with word vectors. in our learning experiments naive bayesian classifier was used on... surrounding text:this performs better on test 2 but does worse as a result of the skew in test 1. a similar measure, the odds ratio [***]<3> is calculated as . influence:3 type:3 pair index:909 citer id:777 citer title:opinion extraction and semantic classification of product reviews citer abstract:the web contains a wealth of product reviews, but sifting through them is a daunting task. ideally, an opinion mining tool would process a set of search results for a given item, generating a list of product attributes (quality, features, etc.) and aggregating opinions about each of them (poor, mixed, good). we begin by identifying the unique properties of this problem and develop a method for automatically distinguishing between positive and negative reviews. our classifier draws on information retrieval techniques for feature extraction and scoring, and the results for various metrics and heuristics vary depending on the testing situation. the best methods work as well as or better than traditional machine learning. when operating on individual sentences collected from web searches, performance is limited due to noise and ambiguity. but in the context of a complete web-based tool and aided by a simple method for grouping sentences into attributes, the results are qualitatively quite useful citee id:287 citee title:book recommending using text categorization with extracted information citee abstract:content-based recommender systems suggest documents, items, and services to users based on learning a profile of the user from rated examples containing information about the given items. text categorization methods are very useful for this task but generally rely on unstructured text. we have developed a bookrecommending system that utilizes semi-structured information about items gathered from the web using simple information extraction techniques. initial experimental results demonstrate surrounding text:mooney et al. [***]<2> faced a similar problem when trying to use amazon review information to train book recommendation tools. they used three variations: calculating an expected value from naive bayes output, reducing the classification problem to a binary problem, and weighting binary ratings based on the extremity of the original score influence:3 type:3 pair index:910 citer id:777 citer title:opinion extraction and semantic classification of product reviews citer abstract:the web contains a wealth of product reviews, but sifting through them is a daunting task. ideally, an opinion mining tool would process a set of search results for a given item, generating a list of product attributes (quality, features, etc.) and aggregating opinions about each of them (poor, mixed, good). we begin by identifying the unique properties of this problem and develop a method for automatically distinguishing between positive and negative reviews. our classifier draws on information retrieval techniques for feature extraction and scoring, and the results for various metrics and heuristics vary depending on the testing situation. the best methods work as well as or better than traditional machine learning. when operating on individual sentences collected from web searches, performance is limited due to noise and ambiguity. but in the context of a complete web-based tool and aided by a simple method for grouping sentences into attributes, the results are qualitatively quite useful citee id:738 citee title:mining product reputions on the web citee abstract:knowing the reputations of your own and/or competitors products is important for marketing and customer relationship management. it is, however, very costly to collect and analyze survey data manually. this paper presents a new framework for mining product reputations on the internet. it automatically collects people's opinions about target products from web pages, and uses text mining techniques to obtain reputations of the products. in advance, we generate, on the basis of human-tested surrounding text:com [8, 22]<3>. the second is from nec japan [***]<3>. finally, there is much relevant work in the general area of information extraction and pattern-finding for classification influence:2 type:2,3 pair index:911 citer id:777 citer title:opinion extraction and semantic classification of product reviews citer abstract:the web contains a wealth of product reviews, but sifting through them is a daunting task. ideally, an opinion mining tool would process a set of search results for a given item, generating a list of product attributes (quality, features, etc.) and aggregating opinions about each of them (poor, mixed, good). we begin by identifying the unique properties of this problem and develop a method for automatically distinguishing between positive and negative reviews. our classifier draws on information retrieval techniques for feature extraction and scoring, and the results for various metrics and heuristics vary depending on the testing situation. the best methods work as well as or better than traditional machine learning. when operating on individual sentences collected from web searches, performance is limited due to noise and ambiguity. but in the context of a complete web-based tool and aided by a simple method for grouping sentences into attributes, the results are qualitatively quite useful citee id:118 citee title:thumbs up? sentiment classification using machine learning techniques citee abstract:we consider the problem of classifying documents not by topic, but by overall sentiment, e.g., determining whether a review is positive or negative. using movie reviews as data, we find that standard machine learning techniques definitively outperform human-produced baselines. however, the three machine learning methods we employed (naive bayes, maximum entropy classification, and support vector machines) do not perform as well on sentiment classification as on traditional topic-based categorization. we conclude by examining factors that make the sentiment classification problem more challenging. surrounding text:recently, pang et al. [***]<2> attempted to classify movie reviews posted to usenet, using accompanying numerical ratings as ground truth. a variety of features and learning methods were employed, but the best results came from unigrams in a presence-based frequency model run through a support vector machine (svm), with 82 influence:2 type:2 pair index:912 citer id:777 citer title:opinion extraction and semantic classification of product reviews citer abstract:the web contains a wealth of product reviews, but sifting through them is a daunting task. ideally, an opinion mining tool would process a set of search results for a given item, generating a list of product attributes (quality, features, etc.) and aggregating opinions about each of them (poor, mixed, good). we begin by identifying the unique properties of this problem and develop a method for automatically distinguishing between positive and negative reviews. our classifier draws on information retrieval techniques for feature extraction and scoring, and the results for various metrics and heuristics vary depending on the testing situation. the best methods work as well as or better than traditional machine learning. when operating on individual sentences collected from web searches, performance is limited due to noise and ambiguity. but in the context of a complete web-based tool and aided by a simple method for grouping sentences into attributes, the results are qualitatively quite useful citee id:444 citee title:distributional clustering of english words citee abstract:we describe and experimentally evaluate a method for automatically clustering words accordingto their distribution in particular syntactic contextsfi words are represented by the relative frequencydistributions of contexts in which they appear, and relative entropy is used to measure the dissimilarityof those distributionsfi clusters are represented by "typical" context distributions averaged from thegiven words according to their probabilities of cluster membership, and in many cases can surrounding text:lin [9]<3> and pereira et al. [***]<3> used linguistic colocations to group words with similar uses or meanings. 2 influence:3 type:3 pair index:913 citer id:777 citer title:opinion extraction and semantic classification of product reviews citer abstract:the web contains a wealth of product reviews, but sifting through them is a daunting task. ideally, an opinion mining tool would process a set of search results for a given item, generating a list of product attributes (quality, features, etc.) and aggregating opinions about each of them (poor, mixed, good). we begin by identifying the unique properties of this problem and develop a method for automatically distinguishing between positive and negative reviews. our classifier draws on information retrieval techniques for feature extraction and scoring, and the results for various metrics and heuristics vary depending on the testing situation. the best methods work as well as or better than traditional machine learning. when operating on individual sentences collected from web searches, performance is limited due to noise and ambiguity. but in the context of a complete web-based tool and aided by a simple method for grouping sentences into attributes, the results are qualitatively quite useful citee id:697 citee title:learning surface text patterns for a question answering system citee abstract:in this paper we explore the power of surface text patterns for open-domain question answering systems. in order to obtain an optimal set of patterns, we have developed a method for learning such patterns automatically. a tagged corpus is built from the internet in a bootstrapping process by providing a few hand-crafted examples of each question type to altavista. patterns are then automatically extracted from the returned documents and standardized. we calculate the precision of each pattern, and the average precision for each question type. these patterns are then applied to find answers to new questions. using the trec-10 question set, we report results for two cases: answers determined from the trec-10 corpus and from the web. surrounding text:for example, i called nikon and i called kodak would ideally be grouped together as i called x. substitutions have been used to solve these sorts of problems in subjectivity identification, text classification, and question answering [***, 19, 26]<3>, but, as indicated in table 3, they are mostly ineffective for our task. we begin by replacing any numerical tokens with number influence:3 type:3 pair index:914 citer id:777 citer title:opinion extraction and semantic classification of product reviews citer abstract:the web contains a wealth of product reviews, but sifting through them is a daunting task. ideally, an opinion mining tool would process a set of search results for a given item, generating a list of product attributes (quality, features, etc.) and aggregating opinions about each of them (poor, mixed, good). we begin by identifying the unique properties of this problem and develop a method for automatically distinguishing between positive and negative reviews. our classifier draws on information retrieval techniques for feature extraction and scoring, and the results for various metrics and heuristics vary depending on the testing situation. the best methods work as well as or better than traditional machine learning. when operating on individual sentences collected from web searches, performance is limited due to noise and ambiguity. but in the context of a complete web-based tool and aided by a simple method for grouping sentences into attributes, the results are qualitatively quite useful citee id:272 citee title:automatically generating extraction patterns from untagged text citee abstract:many corpus-based natural language processing systems rely on text corpora that have been manually annotated with syntactic or semantic tags. in particular, all previous dictionary construction systems for information extraction have used an annotated training corpus or some form of annotated input. we have developed a system called autoslog-ts that creates dictionaries of extraction patterns using only untagged text. autoslog-ts is based on the autoslog system, which generated extraction surrounding text:finally, there is much relevant work in the general area of information extraction and pattern-finding for classification. one technique uses linguistic units called relevancy signatures as part of circus, a tool for sorting documents [***]<3>. 3. for example, i called nikon and i called kodak would ideally be grouped together as i called x. substitutions have been used to solve these sorts of problems in subjectivity identification, text classification, and question answering [18, ***, 26]<3>, but, as indicated in table 3, they are mostly ineffective for our task. we begin by replacing any numerical tokens with number influence:2 type:3 pair index:915 citer id:777 citer title:opinion extraction and semantic classification of product reviews citer abstract:the web contains a wealth of product reviews, but sifting through them is a daunting task. ideally, an opinion mining tool would process a set of search results for a given item, generating a list of product attributes (quality, features, etc.) and aggregating opinions about each of them (poor, mixed, good). we begin by identifying the unique properties of this problem and develop a method for automatically distinguishing between positive and negative reviews. our classifier draws on information retrieval techniques for feature extraction and scoring, and the results for various metrics and heuristics vary depending on the testing situation. the best methods work as well as or better than traditional machine learning. when operating on individual sentences collected from web searches, performance is limited due to noise and ambiguity. but in the context of a complete web-based tool and aided by a simple method for grouping sentences into attributes, the results are qualitatively quite useful citee id:201 citee title:affect analysis of text using fuzzy semantic typing citee abstract:we propose a novel, convenient fusion of natural-language processing and fuzzy logic techniques for analyzing affect content in free text; our main goals are fast analysis and visualization of affect content for decision-making. the primary linguistic resource for fuzzy semantic typing is the fuzzy affect lexicon, from which other important resources are generated, notably the fuzzy thesaurus and affect category groups. free text is tagged with affect categories from the lexicon, and the affect surrounding text:1 affect and direction using fuzzy logic was one interesting approach to classifying sentiment. subasic and huettner [***]<2> manually constructed a lexicon associating words with affect categories, specifying an intensity (strength of affect level) and centrality (degree of relatedness to the category). for example, mayhem would belong, among others, to the category violence with certain levels of intensity and centrality influence:3 type:3 pair index:916 citer id:777 citer title:opinion extraction and semantic classification of product reviews citer abstract:the web contains a wealth of product reviews, but sifting through them is a daunting task. ideally, an opinion mining tool would process a set of search results for a given item, generating a list of product attributes (quality, features, etc.) and aggregating opinions about each of them (poor, mixed, good). we begin by identifying the unique properties of this problem and develop a method for automatically distinguishing between positive and negative reviews. our classifier draws on information retrieval techniques for feature extraction and scoring, and the results for various metrics and heuristics vary depending on the testing situation. the best methods work as well as or better than traditional machine learning. when operating on individual sentences collected from web searches, performance is limited due to noise and ambiguity. but in the context of a complete web-based tool and aided by a simple method for grouping sentences into attributes, the results are qualitatively quite useful citee id:779 citee title:phoaks: a system for sharing recommendations citee abstract:finding relevant, high-quality information on theworld-wide web is a difficult problem. phoaks (people helping one another know stuff) is an experimental system that addresses this problem through a collaborative filtering approach. phoaks works by automatically recognizing, tallying, and redistributing recommendations of web resources mined from usenet news messages. surrounding text:several scoring methods were employed in conjunction with a manually crafted lexicon, but the best performance came from a combination of techniques. another project, using usenet as a corpus, managed to accurately determine when posters were recommending a url in their message [***]<2>. recently, pang et al influence:3 type:3 pair index:917 citer id:777 citer title:opinion extraction and semantic classification of product reviews citer abstract:the web contains a wealth of product reviews, but sifting through them is a daunting task. ideally, an opinion mining tool would process a set of search results for a given item, generating a list of product attributes (quality, features, etc.) and aggregating opinions about each of them (poor, mixed, good). we begin by identifying the unique properties of this problem and develop a method for automatically distinguishing between positive and negative reviews. our classifier draws on information retrieval techniques for feature extraction and scoring, and the results for various metrics and heuristics vary depending on the testing situation. the best methods work as well as or better than traditional machine learning. when operating on individual sentences collected from web searches, performance is limited due to noise and ambiguity. but in the context of a complete web-based tool and aided by a simple method for grouping sentences into attributes, the results are qualitatively quite useful citee id:780 citee title:unsupervised learning of semantic orientation from a hundred-billion-word corpus citee abstract:the evaluative character of a word is called its semantic orientation. a positive semantic orientation implies desirability (e.g., "honest", "intrepid") and a negative semantic orientation implies undesirability (e.g., "disturbing", "superfluous"). this paper introduces a simple algorithm for unsupervised learning of semantic orientation from extremely large corpora. the method involves issuing queries to a web search engine and using pointwise mutual information to analyse the results. the algorithm is empirically evaluated using a training corpus of approximately one hundred billion words -- the subset of the web that is indexed by the chosen search engine. tested with 3,596 words (1,614 positive and 1,982 negative), the algorithm attains an accuracy of 80%. the 3,596 test words include adjectives, adverbs, nouns, and verbs. the accuracy is comparable with the results achieved by hatzivassiloglou and mckeown (1997), using a complex four-stage supervised learning algorithm that is restricted to determining the semantic orientation of adjectives surrounding text:other studies showed that restricting features used for classification to those adjectives that come through as strongly dynamic, gradable, or oriented improved performance in the genre-classification task [6, 24]<2>. turney and littman [***]<2> determined the similarity between two words by counting the number of results returned by web searches joining the words with a near operator. the relationship between an unknown word and a set of manually-selected seeds was used to place it into a positive or negative subjectivity class influence:3 type:2,3 pair index:918 citer id:777 citer title:opinion extraction and semantic classification of product reviews citer abstract:the web contains a wealth of product reviews, but sifting through them is a daunting task. ideally, an opinion mining tool would process a set of search results for a given item, generating a list of product attributes (quality, features, etc.) and aggregating opinions about each of them (poor, mixed, good). we begin by identifying the unique properties of this problem and develop a method for automatically distinguishing between positive and negative reviews. our classifier draws on information retrieval techniques for feature extraction and scoring, and the results for various metrics and heuristics vary depending on the testing situation. the best methods work as well as or better than traditional machine learning. when operating on individual sentences collected from web searches, performance is limited due to noise and ambiguity. but in the context of a complete web-based tool and aided by a simple method for grouping sentences into attributes, the results are qualitatively quite useful citee id:694 citee title:learning subjective adjectives from corpora citee abstract:subjectivity tagging is distinguishing sentences used to present opinions and evaluations from sentences used to objectively present factual information. there are numerous applications for which subjectivity tagging is relevant, including information extraction and information retrieval. this paper identifies strong clues of subjectivity using the results of a method for clustering words according to distributional similarity (lin 1998), seeded by a small amount of detailed manual annotation surrounding text:hatzivassiloglou and mckeown [5]<2> used textual conjunctions such as fair and legitimate or simplistic but well-received to separate similarly- and oppositely-connoted words. other studies showed that restricting features used for classification to those adjectives that come through as strongly dynamic, gradable, or oriented improved performance in the genre-classification task [6, ***]<2>. turney and littman [23]<2> determined the similarity between two words by counting the number of results returned by web searches joining the words with a near operator influence:3 type:3 pair index:919 citer id:777 citer title:opinion extraction and semantic classification of product reviews citer abstract:the web contains a wealth of product reviews, but sifting through them is a daunting task. ideally, an opinion mining tool would process a set of search results for a given item, generating a list of product attributes (quality, features, etc.) and aggregating opinions about each of them (poor, mixed, good). we begin by identifying the unique properties of this problem and develop a method for automatically distinguishing between positive and negative reviews. our classifier draws on information retrieval techniques for feature extraction and scoring, and the results for various metrics and heuristics vary depending on the testing situation. the best methods work as well as or better than traditional machine learning. when operating on individual sentences collected from web searches, performance is limited due to noise and ambiguity. but in the context of a complete web-based tool and aided by a simple method for grouping sentences into attributes, the results are qualitatively quite useful citee id:28 citee title:a corpus study of evaluative and speculative language citee abstract:this paper presents a corpus study of evaluative and speculative language. knowledge of such language would be useful in many applications, such as text categorization and summarization. analyses of annotator agreement and of characteristics of subjective language are performed. this study yields knowledge needed to design effective machine learning systems for identifying subjective language surrounding text:wiebe et al. [***]<2> studied manual annotation of subjectivity at the expression, sentence, and document level and showed that not all potentially subjective elements really are, and that readers opinions vary. 2 influence:3 type:3 pair index:920 citer id:777 citer title:opinion extraction and semantic classification of product reviews citer abstract:the web contains a wealth of product reviews, but sifting through them is a daunting task. ideally, an opinion mining tool would process a set of search results for a given item, generating a list of product attributes (quality, features, etc.) and aggregating opinions about each of them (poor, mixed, good). we begin by identifying the unique properties of this problem and develop a method for automatically distinguishing between positive and negative reviews. our classifier draws on information retrieval techniques for feature extraction and scoring, and the results for various metrics and heuristics vary depending on the testing situation. the best methods work as well as or better than traditional machine learning. when operating on individual sentences collected from web searches, performance is limited due to noise and ambiguity. but in the context of a complete web-based tool and aided by a simple method for grouping sentences into attributes, the results are qualitatively quite useful citee id:625 citee title:identifying collocations for recognizing opinions citee abstract:subjectivity in natural language refers to aspects of language used to express opinions and evaluations surrounding text:for example, i called nikon and i called kodak would ideally be grouped together as i called x. substitutions have been used to solve these sorts of problems in subjectivity identification, text classification, and question answering [18, 19, ***]<3>, but, as indicated in table 3, they are mostly ineffective for our task. we begin by replacing any numerical tokens with number influence:3 type:3 pair index:921 citer id:777 citer title:opinion extraction and semantic classification of product reviews citer abstract:the web contains a wealth of product reviews, but sifting through them is a daunting task. ideally, an opinion mining tool would process a set of search results for a given item, generating a list of product attributes (quality, features, etc.) and aggregating opinions about each of them (poor, mixed, good). we begin by identifying the unique properties of this problem and develop a method for automatically distinguishing between positive and negative reviews. our classifier draws on information retrieval techniques for feature extraction and scoring, and the results for various metrics and heuristics vary depending on the testing situation. the best methods work as well as or better than traditional machine learning. when operating on individual sentences collected from web searches, performance is limited due to noise and ambiguity. but in the context of a complete web-based tool and aided by a simple method for grouping sentences into attributes, the results are qualitatively quite useful citee id:781 citee title:using suffix arrays to compute term frequency and document frequency for all substrings in a corpus citee abstract:bigrams and trigrams are commonly used in statistical natural language processing; this paper will describe techniques for working with much longer n-grams. suffix arrays (manber and myers 1990) were first introduced to compute the frequency and location of a substring (n-gram) in a sequence (corpus) of length n. to compute frequencies over all n(n + 1)/2 substrings in a corpus, the substrings are grouped into a manageable number of equivalence classes. in this way, a prohibitive computation over substrings is reduced to a manageable computation over classes. this paper presents both the algorithms and the code that were used to compute term frequency (tf) and document frequency (dr)for all n-grams in two large corpora, an english corpus of 50 million words of wall street journal and a japanese corpus of 216 million characters of mainichi shimbun.the second half of the paper uses these frequencies to find "interesting" substrings. lexicographers have been interested in n-grams with high mutual information (mi) where the joint term frequency is higher than what would be expected by chance, assuming that the parts of the n-gram combine independently. residual inverse document frequency (ridf) compares document frequency to another model of chance where terms with a particular term frequency are distributed randomly throughout the collection. mi tends to pick out phrases with noncompositional semantics (which often violate the independence assumption) whereas ridf tends to highlight technical terminology, names, and good keywords for information retrieval (which tend to exhibit nonrandom distributions over documents). the combination of both mi and ridf is better than either by itself in a japanese word extraction task. surrounding text:ways of matching testing data to the scored features are discussed later. to improve performance, we drew on churchs suffix array algorithm [***]<3>. future work might incorporate techniques from probabilistic suffix trees [2]<3> influence:3 type:3 pair index:922 citer id:782 citer title:opinion observer analyzing and comparing opinions citer abstract:the web has become an excellent source for gathering consumer opinions. there are now numerous web sites containing such opinions, e.g., customer reviews of products, forums, discussion groups, and blogs. this paper focuses on online customer reviews of products. it makes two contributions. first, it proposes a novel framework for analyzing and comparing consumer opinions of competing products. a prototype system called opinion observer is also implemented. the system is such that with a single glance of its visualization, the user is able to clearly see the strengths and weaknesses of each product in the minds of consumers in terms of various product features. this comparison is useful to both potential customers and product manufacturers. for a potential customer, he/she can see a visual side-by-side and feature-by-feature comparison of consumer opinions on these products, which helps him/her to decide which product to buy. for a product manufacturer, the comparison enables it to easily gather marketing intelligence and product benchmarking information. second, a new technique based on language pattern mining is proposed to extract product features from pros and cons in a particular type of reviews. such features form the basis for the above comparison. experimental results show that the technique is highly effective and outperform existing methods significantly citee id:737 citee title:mining newsgroups using networks arising from social behavior citee abstract:recent advances in information retrieval over hyperlinked corpora have convincingly demonstrated that links carry less noisy information than text. we investigate the feasibility of applying link-based methods in new applications domains. the specific application we consider is to partition authors into opposite camps within a given topic in the context of newsgroups. a typical newsgroup posting consists of one or more quoted lines from another posting followed by the opinion of the author. this social behavior gives rise to a network in which the vertices are individuals and the links represent "responded-to" relationships. an interesting characteristic of many newsgroups is that people more frequently respond to a message when they disagree than when they agree. this behavior is in sharp contrast to the www link graph, where linkage is an indicator of agreement or common interest. by analyzing the graph structure of the responses, we are able to effectively classify people into opposite camps. in contrast, methods based on statistical analysis of text yield low accuracy on such datasets because the vocabulary used by the two sides tends to be largely identical, and many newsgroup postings consist of relatively few words of text. surrounding text:they show that the classifiers perform well on whole reviews, but poorly on sentences because a sentence contains much less information. [***]<2> finds that supervised sentiment classification is inaccurate. they proposed a method based on social network for the purpose influence:2 type:2 pair index:923 citer id:782 citer title:opinion observer analyzing and comparing opinions citer abstract:the web has become an excellent source for gathering consumer opinions. there are now numerous web sites containing such opinions, e.g., customer reviews of products, forums, discussion groups, and blogs. this paper focuses on online customer reviews of products. it makes two contributions. first, it proposes a novel framework for analyzing and comparing consumer opinions of competing products. a prototype system called opinion observer is also implemented. the system is such that with a single glance of its visualization, the user is able to clearly see the strengths and weaknesses of each product in the minds of consumers in terms of various product features. this comparison is useful to both potential customers and product manufacturers. for a potential customer, he/she can see a visual side-by-side and feature-by-feature comparison of consumer opinions on these products, which helps him/her to decide which product to buy. for a product manufacturer, the comparison enables it to easily gather marketing intelligence and product benchmarking information. second, a new technique based on language pattern mining is proposed to extract product features from pros and cons in a particular type of reviews. such features form the basis for the above comparison. experimental results show that the technique is highly effective and outperform existing methods significantly citee id:328 citee title:collective information extraction with relational markov networks citee abstract:most information extraction (ie) systems treat separate potential extractions as independent. however, in many cases, considering influences between different potential extractions could improve overall accuracy. statistical methods based on undirected graphical models, such as conditional random fields (crfs), have been shown to be an effective approach to learning accurate ie systems. we present a new ie method that employs relational markov networks (a generalization of crfs), which can represent arbitrary dependencies between extractions. this allows for "collective information extraction" that exploits the mutual influence between possible extractions. experiments on learning to extract protein names from biomedical text demonstrate the advantages of this approach surrounding text:recently, information extraction from texts was studied by several researchers. their focus is on using machine learning and nlp methods to extract/classify named entities and relations [***, 10, 14, 20, 31]<3>. our task involves identifying product features which are usually not named entities and can be expressed as nouns, noun phrases, verbs, and adjectives influence:3 type:2 pair index:924 citer id:782 citer title:opinion observer analyzing and comparing opinions citer abstract:the web has become an excellent source for gathering consumer opinions. there are now numerous web sites containing such opinions, e.g., customer reviews of products, forums, discussion groups, and blogs. this paper focuses on online customer reviews of products. it makes two contributions. first, it proposes a novel framework for analyzing and comparing consumer opinions of competing products. a prototype system called opinion observer is also implemented. the system is such that with a single glance of its visualization, the user is able to clearly see the strengths and weaknesses of each product in the minds of consumers in terms of various product features. this comparison is useful to both potential customers and product manufacturers. for a potential customer, he/she can see a visual side-by-side and feature-by-feature comparison of consumer opinions on these products, which helps him/her to decide which product to buy. for a product manufacturer, the comparison enables it to easily gather marketing intelligence and product benchmarking information. second, a new technique based on language pattern mining is proposed to extract product features from pros and cons in a particular type of reviews. such features form the basis for the above comparison. experimental results show that the technique is highly effective and outperform existing methods significantly citee id:629 citee title:mining the peanut gallery: opinion extraction and semantic classification of product reviews citee abstract:the web contains a wealth of product reviews, but sifting through them is a daunting task. ideally, an opinion mining tool would process a set of search results for a given item, generating a list of product attributes (quality, features, etc.) and aggregating opinions about each of them (poor, mixed, good). we begin by identifying the unique properties of this problem and develop a method for automatically distinguishing between positive and negative reviews. our classifier draws on information retrieval techniques for feature extraction and scoring, and the results for various metrics and heuristics vary depending on the testing situation. the best methods work as well as or better than traditional machine learning. when operating on individual sentences collected from web searches, performance is limited due to noise and ambiguity. but in the context of a complete web-based tool and aided by a simple method for grouping sentences into attributes, the results are qualitatively quite useful. surrounding text:[29]<2> examines several supervised machine learning methods for sentiment classification of movie reviews. [***]<2> also experiments a number of learning methods for review classification. they show that the classifiers perform well on whole reviews, but poorly on sentences because a sentence contains much less information influence:1 type:2 pair index:925 citer id:782 citer title:opinion observer analyzing and comparing opinions citer abstract:the web has become an excellent source for gathering consumer opinions. there are now numerous web sites containing such opinions, e.g., customer reviews of products, forums, discussion groups, and blogs. this paper focuses on online customer reviews of products. it makes two contributions. first, it proposes a novel framework for analyzing and comparing consumer opinions of competing products. a prototype system called opinion observer is also implemented. the system is such that with a single glance of its visualization, the user is able to clearly see the strengths and weaknesses of each product in the minds of consumers in terms of various product features. this comparison is useful to both potential customers and product manufacturers. for a potential customer, he/she can see a visual side-by-side and feature-by-feature comparison of consumer opinions on these products, which helps him/her to decide which product to buy. for a product manufacturer, the comparison enables it to easily gather marketing intelligence and product benchmarking information. second, a new technique based on language pattern mining is proposed to extract product features from pros and cons in a particular type of reviews. such features form the basis for the above comparison. experimental results show that the technique is highly effective and outperform existing methods significantly citee id:783 citee title:web-scale information extraction in knowitall (preliminary results) citee abstract:manually querying search engines in order to accumulate a large bodyof factual information is a tedious, error-prone process of piecemealsearch. search engines retrieve and rank potentially relevantdocuments for human perusal, but do not extract facts, assessconfidence, or fuse information from multiple documents. this paperintroduces knowitall, a system that aims to automate the tedious process ofextracting large collections of facts from the web in an autonomous,domain-independent, and scalable manner.the paper describes preliminary experiments in which an instance of knowitall, running for four days on a single machine, was able to automatically extract 54,753 facts. knowitall associates a probability with each fact enabling it to trade off precision and recall. the paper analyzes knowitall's architecture and reports on lessons learned for the design of large-scale information extraction systems surrounding text:recently, information extraction from texts was studied by several researchers. their focus is on using machine learning and nlp methods to extract/classify named entities and relations [5, ***, 14, 20, 31]<3>. our task involves identifying product features which are usually not named entities and can be expressed as nouns, noun phrases, verbs, and adjectives influence:3 type:2 pair index:926 citer id:782 citer title:opinion observer analyzing and comparing opinions citer abstract:the web has become an excellent source for gathering consumer opinions. there are now numerous web sites containing such opinions, e.g., customer reviews of products, forums, discussion groups, and blogs. this paper focuses on online customer reviews of products. it makes two contributions. first, it proposes a novel framework for analyzing and comparing consumer opinions of competing products. a prototype system called opinion observer is also implemented. the system is such that with a single glance of its visualization, the user is able to clearly see the strengths and weaknesses of each product in the minds of consumers in terms of various product features. this comparison is useful to both potential customers and product manufacturers. for a potential customer, he/she can see a visual side-by-side and feature-by-feature comparison of consumer opinions on these products, which helps him/her to decide which product to buy. for a product manufacturer, the comparison enables it to easily gather marketing intelligence and product benchmarking information. second, a new technique based on language pattern mining is proposed to extract product features from pros and cons in a particular type of reviews. such features form the basis for the above comparison. experimental results show that the technique is highly effective and outperform existing methods significantly citee id:631 citee title:wordnet: an electronic lexical database citee abstract:wordnet is perhaps the most important and widely used lexical resource for natural language processing systems up to now. wordnet: an electronic lexical database, edited by christiane fellbaum, discusses the design of wordnet from both theoretical and historical perspectives, provides an up-to-date description of the lexical database, and presents a set of applications of wordnet. the book contains a foreword by george miller, an introduction by christiane fellbaum, seven chapters from the cognitive sciences laboratory of princeton university, where wordnet was produced, and nine chapters contributed by scientists from elsewhere. surrounding text:our current system uses a simple method. the basic idea is to employ wordnet [***]<1> to check if any synonym groups/sets exist among the features. for a given word, it may have more than one sense, i influence:3 type:3 pair index:927 citer id:782 citer title:opinion observer analyzing and comparing opinions citer abstract:the web has become an excellent source for gathering consumer opinions. there are now numerous web sites containing such opinions, e.g., customer reviews of products, forums, discussion groups, and blogs. this paper focuses on online customer reviews of products. it makes two contributions. first, it proposes a novel framework for analyzing and comparing consumer opinions of competing products. a prototype system called opinion observer is also implemented. the system is such that with a single glance of its visualization, the user is able to clearly see the strengths and weaknesses of each product in the minds of consumers in terms of various product features. this comparison is useful to both potential customers and product manufacturers. for a potential customer, he/she can see a visual side-by-side and feature-by-feature comparison of consumer opinions on these products, which helps him/her to decide which product to buy. for a product manufacturer, the comparison enables it to easily gather marketing intelligence and product benchmarking information. second, a new technique based on language pattern mining is proposed to extract product features from pros and cons in a particular type of reviews. such features form the basis for the above comparison. experimental results show that the technique is highly effective and outperform existing methods significantly citee id:656 citee title:information extraction with hmm structures learned by stochastic optimization citee abstract:recent research has demonstrated the strong performance of hidden markov models applied to information extraction -- the task of populating database slots with corresponding phrases from text documents. a remaining problem, however, is the selection of state-transition structure for the model. this paper demonstrates that extraction accuracy strongly depends on the selection of structure, and presents an algorithm for automatically finding good structures by stochastic optimization surrounding text:recently, information extraction from texts was studied by several researchers. their focus is on using machine learning and nlp methods to extract/classify named entities and relations [5, 10, ***, 20, 31]<3>. our task involves identifying product features which are usually not named entities and can be expressed as nouns, noun phrases, verbs, and adjectives influence:3 type:2 pair index:928 citer id:782 citer title:opinion observer analyzing and comparing opinions citer abstract:the web has become an excellent source for gathering consumer opinions. there are now numerous web sites containing such opinions, e.g., customer reviews of products, forums, discussion groups, and blogs. this paper focuses on online customer reviews of products. it makes two contributions. first, it proposes a novel framework for analyzing and comparing consumer opinions of competing products. a prototype system called opinion observer is also implemented. the system is such that with a single glance of its visualization, the user is able to clearly see the strengths and weaknesses of each product in the minds of consumers in terms of various product features. this comparison is useful to both potential customers and product manufacturers. for a potential customer, he/she can see a visual side-by-side and feature-by-feature comparison of consumer opinions on these products, which helps him/her to decide which product to buy. for a product manufacturer, the comparison enables it to easily gather marketing intelligence and product benchmarking information. second, a new technique based on language pattern mining is proposed to extract product features from pros and cons in a particular type of reviews. such features form the basis for the above comparison. experimental results show that the technique is highly effective and outperform existing methods significantly citee id:459 citee title:effects of adjective orientation and gradability on sentence subjectivity citee abstract:subjectivity is a pragmatic, sentence-level feature that has important implications for texl processing applicalions such as information exlractiou and information iclricwd. we study tile elfeels of dymunic adjectives, semantically oriented adjectives, and gradable ad.ieclivcs on a simple subjectivity classiiicr, and establish lhat lhcy arc strong predictors of subjectivity. a novel trainable mclhod thai statistically combines two indicators of gradability is presented and ewlhlalcd, complementing exisling automatic icchniques for assigning orientation labels. surrounding text:however, social networks are not applicable to customer reviews. [***]<2> investigates sentence subjectivity classification. a method is proposed to find adjectives that are indicative of positive or negative opinions influence:3 type:2 pair index:929 citer id:782 citer title:opinion observer analyzing and comparing opinions citer abstract:the web has become an excellent source for gathering consumer opinions. there are now numerous web sites containing such opinions, e.g., customer reviews of products, forums, discussion groups, and blogs. this paper focuses on online customer reviews of products. it makes two contributions. first, it proposes a novel framework for analyzing and comparing consumer opinions of competing products. a prototype system called opinion observer is also implemented. the system is such that with a single glance of its visualization, the user is able to clearly see the strengths and weaknesses of each product in the minds of consumers in terms of various product features. this comparison is useful to both potential customers and product manufacturers. for a potential customer, he/she can see a visual side-by-side and feature-by-feature comparison of consumer opinions on these products, which helps him/her to decide which product to buy. for a product manufacturer, the comparison enables it to easily gather marketing intelligence and product benchmarking information. second, a new technique based on language pattern mining is proposed to extract product features from pros and cons in a particular type of reviews. such features form the basis for the above comparison. experimental results show that the technique is highly effective and outperform existing methods significantly citee id:417 citee title:direction-based text interpretation as an information access refinement citee abstract:a text-based intelligent system should provide more in-depth information about the contents of its corpus than does a standard information retrieval system, while at the same time avoiding the complexity and resource-consuming behavior of detailed text understanders. instead of focusing on discovering documents that pertain to some topic of interest to the user, an approach is introduced based on the criterion of directionality (e.g., is the agent in favor of, neutral, or opposed to the event?). a method is described for coercing sentence meanings into a metaphoric model such that the only semantic interpretation needed in order to determine the directionality of a sentence is done with respect to the model. this interpretation method is designed to be an integrated component of a hybrid information access system. surrounding text:sentiment classification sentiment classification classifies opinion texts or sentences as positive or negative. work of hearst [***]<2> on classification of entire documents uses models inspired by cognitive linguistics. das and chen [8]<2> use a manually crafted lexicon in conjunction with several scoring methods to classify stock postings influence:3 type:2 pair index:930 citer id:782 citer title:opinion observer analyzing and comparing opinions citer abstract:the web has become an excellent source for gathering consumer opinions. there are now numerous web sites containing such opinions, e.g., customer reviews of products, forums, discussion groups, and blogs. this paper focuses on online customer reviews of products. it makes two contributions. first, it proposes a novel framework for analyzing and comparing consumer opinions of competing products. a prototype system called opinion observer is also implemented. the system is such that with a single glance of its visualization, the user is able to clearly see the strengths and weaknesses of each product in the minds of consumers in terms of various product features. this comparison is useful to both potential customers and product manufacturers. for a potential customer, he/she can see a visual side-by-side and feature-by-feature comparison of consumer opinions on these products, which helps him/her to decide which product to buy. for a product manufacturer, the comparison enables it to easily gather marketing intelligence and product benchmarking information. second, a new technique based on language pattern mining is proposed to extract product features from pros and cons in a particular type of reviews. such features form the basis for the above comparison. experimental results show that the technique is highly effective and outperform existing methods significantly citee id:403 citee title:mining and summarizing customer reviews citee abstract:merchants selling products on the web often ask their customers to review the products that they have purchased and the associated services. as e-commerce is becoming more and more popular, the number of customer reviews that a product receives grows rapidly. for a popular product, the number of reviews can be in hundreds or even thousands. this makes it difficult for a potential customer to read them to make an informed decision on whether to purchase the product. it also makes it difficult for the manufacturer of the product to keep track and to manage customer opinions. for the manufacturer, there are additional difficulties because many merchant sites may sell the same product and the manufacturer normally produces many kinds of products. in this research, we aim to mine and to summarize all the customer reviews of a product. this summarization task is different from traditional text summarization because we only mine the features of the product on which the customers have expressed their opinions and whether the opinions are positive or negative. we do not summarize the reviews by selecting a subset or rewrite some of the original sentences from the reviews to capture the main points as in the classic text summarization. our task is performed in three steps: (1) mining product features that have been commented on by customers; (2) identifying opinion sentences in each review and deciding whether each opinion sentence is positive or negative; (3) summarizing the results. this paper proposes several novel techniques to perform these tasks. our experimental results using reviews of a number of products sold online demonstrate the effectiveness of the techniques surrounding text:for format (3), we need to identify both product features and opinion orientations. in [***]<2>, we proposed several techniques to perform these tasks for format (3), which are also useful for format (1). in both formats (1) and (3), reviewers typically use full sentences. the method is based on natural language processing and supervised pattern discovery. we show that the techniques in [***]<2> are not suitable for format (2) because of short phrases or incomplete sentences (we call them sentence segments) in pros and cons rather than full sentences. we do not analyze detailed reviews of format (2) as they are elaborations of pros and cons. a new technique is proposed to identify product features from pros and cons of review format (2). existing techniques in [***]<2> are not suitable for this case. our experimental results show that the proposed technique is highly effective. below, we mainly discuss prior work related to analysis of customer reviews or opinions. in [***]<2>, we propose several methods to analyze customer reviews of format (3). they perform the same tasks of identifying product features on which customers have expressed their opinions and determining whether the opinions are positive or negative. they perform the same tasks of identifying product features on which customers have expressed their opinions and determining whether the opinions are positive or negative. however, the techniques in [***]<2>, which are primarily based on unsupervised itemset mining, are only suitable for reviews of formats (3) and (1). reviews of these formats usually consist of full sentences. currently we do not use detailed reviews of format (2). although the methods in [***]<2> can be applied to detailed reviews of format (2), analyzing short sentence segments in pros and cons produce more accurate results. in [23]<2>, morinaga et al$ compare information of different products in a category through search to find the reputation of the products. however, using noun phrases tends to produce too many non-terms, while using reoccurring phrases misses many low frequency terms, terms with variations, and terms with only one word. as shown in [***]<2> using the existing terminology finding system fastr [11]<2> produces very poor results. furthermore, using noun phrases are not sufficient for finding product features. 4). another important point to note is that a feature may not be a noun or noun phrase, which is used in [***]<2>. verbs may be features as well, e. the techniques presented in section 3. 3 and those in [***]<2>. alone are useful in situations where a fast and approximate solution is sufficient. due to space limitations, this interface is not given here. the product features and opinions found by the automatic techniques in [***]<2> are also displayed in the window on the right. if the results generated by automatic techniques are correct, the analyst simply clicks on accept. we also observe that pos tagging makes many mistakes due to the brief segments (incomplete sentences) in pros and cons. columns 4-5 and 8-9 show the recall and precision of the fbs system in [***]<2>. the low recall and precision values indicate that the techniques there are not suitable for pros and cons, which are mostly short phrases or incomplete sentences influence:1 type:2 pair index:931 citer id:782 citer title:opinion observer analyzing and comparing opinions citer abstract:the web has become an excellent source for gathering consumer opinions. there are now numerous web sites containing such opinions, e.g., customer reviews of products, forums, discussion groups, and blogs. this paper focuses on online customer reviews of products. it makes two contributions. first, it proposes a novel framework for analyzing and comparing consumer opinions of competing products. a prototype system called opinion observer is also implemented. the system is such that with a single glance of its visualization, the user is able to clearly see the strengths and weaknesses of each product in the minds of consumers in terms of various product features. this comparison is useful to both potential customers and product manufacturers. for a potential customer, he/she can see a visual side-by-side and feature-by-feature comparison of consumer opinions on these products, which helps him/her to decide which product to buy. for a product manufacturer, the comparison enables it to easily gather marketing intelligence and product benchmarking information. second, a new technique based on language pattern mining is proposed to extract product features from pros and cons in a particular type of reviews. such features form the basis for the above comparison. experimental results show that the technique is highly effective and outperform existing methods significantly citee id:345 citee title:conditional random fields: probabilistic models for segmenting and labeling or sequence data citee abstract:we present conditional random fields, a framework for building probabilistic models to segment and label sequence data. conditional random fields offer several advantages over hidden markov models and stochastic grammars for such tasks, including the ability to relax strong independence assumptions made in those models. conditional random fields also avoid a fundamental limitation of maximum entropy markov models (memms) and other discriminative markov models based on directed graphical models, which can be biased towards states with few successor states. we present iterative parameter estimation algorithms for conditional random fields and compare the performance of the resulting models to hmms and memms on synthetic and natural-language data. surrounding text:recently, information extraction from texts was studied by several researchers. their focus is on using machine learning and nlp methods to extract/classify named entities and relations [5, 10, 14, ***, 31]<3>. our task involves identifying product features which are usually not named entities and can be expressed as nouns, noun phrases, verbs, and adjectives influence:3 type:2 pair index:932 citer id:782 citer title:opinion observer analyzing and comparing opinions citer abstract:the web has become an excellent source for gathering consumer opinions. there are now numerous web sites containing such opinions, e.g., customer reviews of products, forums, discussion groups, and blogs. this paper focuses on online customer reviews of products. it makes two contributions. first, it proposes a novel framework for analyzing and comparing consumer opinions of competing products. a prototype system called opinion observer is also implemented. the system is such that with a single glance of its visualization, the user is able to clearly see the strengths and weaknesses of each product in the minds of consumers in terms of various product features. this comparison is useful to both potential customers and product manufacturers. for a potential customer, he/she can see a visual side-by-side and feature-by-feature comparison of consumer opinions on these products, which helps him/her to decide which product to buy. for a product manufacturer, the comparison enables it to easily gather marketing intelligence and product benchmarking information. second, a new technique based on language pattern mining is proposed to extract product features from pros and cons in a particular type of reviews. such features form the basis for the above comparison. experimental results show that the technique is highly effective and outperform existing methods significantly citee id:658 citee title:integrating classification and association rule mining citee abstract:concept lattice is an efficient tool for data analysis. in this paper we show how classification and association rule mining can be unified under concept lattice framework. we present a fast algorithm to extract association and classification rules from concept lattice surrounding text:the problem of mining association rules is to generate all association rules in d that have support and confidence greater than the userspecified minimum support and minimum confidence. we use the association mining system cba [***]<1> to mine rules. we use 1% as the minimum support, but do not use minimum confidence here, which will be used later influence:3 type:3 pair index:933 citer id:782 citer title:opinion observer analyzing and comparing opinions citer abstract:the web has become an excellent source for gathering consumer opinions. there are now numerous web sites containing such opinions, e.g., customer reviews of products, forums, discussion groups, and blogs. this paper focuses on online customer reviews of products. it makes two contributions. first, it proposes a novel framework for analyzing and comparing consumer opinions of competing products. a prototype system called opinion observer is also implemented. the system is such that with a single glance of its visualization, the user is able to clearly see the strengths and weaknesses of each product in the minds of consumers in terms of various product features. this comparison is useful to both potential customers and product manufacturers. for a potential customer, he/she can see a visual side-by-side and feature-by-feature comparison of consumer opinions on these products, which helps him/her to decide which product to buy. for a product manufacturer, the comparison enables it to easily gather marketing intelligence and product benchmarking information. second, a new technique based on language pattern mining is proposed to extract product features from pros and cons in a particular type of reviews. such features form the basis for the above comparison. experimental results show that the technique is highly effective and outperform existing methods significantly citee id:734 citee title:mining data records from web pages citee abstract:a large amount of information on the web is contained in regularly structured objects, which we call data records. such data records are important because they often present the essential information of their host pages, e.g., lists of products or services. it is useful to mine such data records in order to extract information from them to provide value-added services. existing automatic techniques are not satisfactory because of their poor accuracies. in this paper, we propose a more effective technique to perform the task. the technique is based on two observations about data records on the web and a string matching algorithm. the proposed technique is able to mine both contiguous and non-contiguous data records. our experimental results show that the proposed technique outperforms existing techniques substantially. surrounding text:both these approaches are based on the fact that reviews at each site are displayed according to some fixed layout templates. we use the second approach which is provided by our system mdr-2 [37]<1>, which is improvement of mdr [***]<2>. mdr-2 is able to extract individual data fields in data records. mdr-2 is able to extract individual data fields in data records. due to space limitations, we will not discuss it further (see [***]<2>[37]<1> for more details). 348 4 influence:3 type:3 pair index:934 citer id:782 citer title:opinion observer analyzing and comparing opinions citer abstract:the web has become an excellent source for gathering consumer opinions. there are now numerous web sites containing such opinions, e.g., customer reviews of products, forums, discussion groups, and blogs. this paper focuses on online customer reviews of products. it makes two contributions. first, it proposes a novel framework for analyzing and comparing consumer opinions of competing products. a prototype system called opinion observer is also implemented. the system is such that with a single glance of its visualization, the user is able to clearly see the strengths and weaknesses of each product in the minds of consumers in terms of various product features. this comparison is useful to both potential customers and product manufacturers. for a potential customer, he/she can see a visual side-by-side and feature-by-feature comparison of consumer opinions on these products, which helps him/her to decide which product to buy. for a product manufacturer, the comparison enables it to easily gather marketing intelligence and product benchmarking information. second, a new technique based on language pattern mining is proposed to extract product features from pros and cons in a particular type of reviews. such features form the basis for the above comparison. experimental results show that the technique is highly effective and outperform existing methods significantly citee id:91 citee title:a hierarchical approach to wrapper induction citee abstract:with the tremendous amount of information that becomes available on the web on a daily basis, the ability to quickly develop information agents has become a crucial problem. a vital component of any web-based information agent is a set of wrappers that can extract the relevant data from semistructured information sources. our novel approach to wrapper induction is based on the idea of hierarchical information extraction, which turns the hard problem of extracting data from an arbitrarily complex document into a series of easier extraction tasks. we introduce an inductive algorithm, stalker, that generates high accuracy extraction rules based on user-labeled training examples. labeling the training data represents the major bottleneck in using wrapper induction techniques, and our experimental results show that stalker does significantly better then other approaches; on one hand, stalker requires up to two orders of magnitude fewer examples than other algorithms, while on the other hand it can handle information sources that could not be wrapped by existing techniques. surrounding text:fortunately, there are existing technologies for this purpose. one approach is wrapper induction [***]<1>. a wrapper induction system allows the user to manually label a set of reviews from each site and the system learns extraction rules from them influence:3 type:3 pair index:935 citer id:782 citer title:opinion observer analyzing and comparing opinions citer abstract:the web has become an excellent source for gathering consumer opinions. there are now numerous web sites containing such opinions, e.g., customer reviews of products, forums, discussion groups, and blogs. this paper focuses on online customer reviews of products. it makes two contributions. first, it proposes a novel framework for analyzing and comparing consumer opinions of competing products. a prototype system called opinion observer is also implemented. the system is such that with a single glance of its visualization, the user is able to clearly see the strengths and weaknesses of each product in the minds of consumers in terms of various product features. this comparison is useful to both potential customers and product manufacturers. for a potential customer, he/she can see a visual side-by-side and feature-by-feature comparison of consumer opinions on these products, which helps him/her to decide which product to buy. for a product manufacturer, the comparison enables it to easily gather marketing intelligence and product benchmarking information. second, a new technique based on language pattern mining is proposed to extract product features from pros and cons in a particular type of reviews. such features form the basis for the above comparison. experimental results show that the technique is highly effective and outperform existing methods significantly citee id:115 citee title:sentiment analysis: capturing favorability using natural language processing citee abstract:this paper illustrates a sentiment analysis approach to extract sentiments associated with polarities of positive or negative for specific subjects from a document, instead of classifying the whole document into positive or negative.the essential issues in sentiment analysis are to identify how sentiments are expressed in texts and whether the expressions indicate positive (favorable) or negative (unfavorable) opinions toward the subject. in order to improve the accuracy of the sentiment analysis, it is important to properly identify the semantic relationships between the sentiment expressions and the subject. by applying semantic analysis with a syntactic parser and sentiment lexicon, our prototype system achieved high precision (75-95%, depending on the data) in finding sentiments within web pages and news articles. surrounding text:[34]<2> proposes a similar method for nouns. other related works on sentiment classification and opinions discovery include [***, 27, 30, 35, 36]<2>. our work differs from sentiment and subjectivity classification as they do not identify features commented by customers or what customers praise or complain about influence:2 type:2 pair index:936 citer id:782 citer title:opinion observer analyzing and comparing opinions citer abstract:the web has become an excellent source for gathering consumer opinions. there are now numerous web sites containing such opinions, e.g., customer reviews of products, forums, discussion groups, and blogs. this paper focuses on online customer reviews of products. it makes two contributions. first, it proposes a novel framework for analyzing and comparing consumer opinions of competing products. a prototype system called opinion observer is also implemented. the system is such that with a single glance of its visualization, the user is able to clearly see the strengths and weaknesses of each product in the minds of consumers in terms of various product features. this comparison is useful to both potential customers and product manufacturers. for a potential customer, he/she can see a visual side-by-side and feature-by-feature comparison of consumer opinions on these products, which helps him/her to decide which product to buy. for a product manufacturer, the comparison enables it to easily gather marketing intelligence and product benchmarking information. second, a new technique based on language pattern mining is proposed to extract product features from pros and cons in a particular type of reviews. such features form the basis for the above comparison. experimental results show that the technique is highly effective and outperform existing methods significantly citee id:785 citee title:towards a robust metric of opinion. aaai spring symposium on exploring attitude and affect in text citee abstract:this paper describes an automated system for detecting polar expressions about a topic of interest. the two elementary components of this approach are a shallow nlp polar language extraction system and a machine learning based topic classifier. these components are composed together by making a simple but accurate collocation assumption: if a topical sentence contains polar language, the system predicts that the polar language is reflective of the topic, and not some other subject matter. we evaluate our system, components and assumption on a corpus of online consumer messages. based on these components, we discuss how to measure the overall sentiment about a particular topic as expressed in online messages authored by many different people. we propose to use the fundamentals of bayesian statistics to form an aggregate authorial opinion metric. this metric would propagate uncertainties introduced by the polarity and topic modules to facilitate statistically valid comparisons of opinion across multiple topics. surrounding text:[34]<2> proposes a similar method for nouns. other related works on sentiment classification and opinions discovery include [26, ***, 30, 35, 36]<2>. our work differs from sentiment and subjectivity classification as they do not identify features commented by customers or what customers praise or complain about influence:3 type:2 pair index:937 citer id:782 citer title:opinion observer analyzing and comparing opinions citer abstract:the web has become an excellent source for gathering consumer opinions. there are now numerous web sites containing such opinions, e.g., customer reviews of products, forums, discussion groups, and blogs. this paper focuses on online customer reviews of products. it makes two contributions. first, it proposes a novel framework for analyzing and comparing consumer opinions of competing products. a prototype system called opinion observer is also implemented. the system is such that with a single glance of its visualization, the user is able to clearly see the strengths and weaknesses of each product in the minds of consumers in terms of various product features. this comparison is useful to both potential customers and product manufacturers. for a potential customer, he/she can see a visual side-by-side and feature-by-feature comparison of consumer opinions on these products, which helps him/her to decide which product to buy. for a product manufacturer, the comparison enables it to easily gather marketing intelligence and product benchmarking information. second, a new technique based on language pattern mining is proposed to extract product features from pros and cons in a particular type of reviews. such features form the basis for the above comparison. experimental results show that the technique is highly effective and outperform existing methods significantly citee id:118 citee title:thumbs up? sentiment classification using machine learning techniques citee abstract:we consider the problem of classifying documents not by topic, but by overall sentiment, e.g., determining whether a review is positive or negative. using movie reviews as data, we find that standard machine learning techniques definitively outperform human-produced baselines. however, the three machine learning methods we employed (naive bayes, maximum entropy classification, and support vector machines) do not perform as well on sentiment classification as on traditional topic-based categorization. we conclude by examining factors that make the sentiment classification problem more challenging. surrounding text:[33]<2> applies a unsupervised learning technique based on mutual information between document phrases and the words excellent and poor to find indicative words of opinions for classification. [***]<2> examines several supervised machine learning methods for sentiment classification of movie reviews. [9]<2> also experiments a number of learning methods for review classification influence:3 type:2 pair index:938 citer id:782 citer title:opinion observer analyzing and comparing opinions citer abstract:the web has become an excellent source for gathering consumer opinions. there are now numerous web sites containing such opinions, e.g., customer reviews of products, forums, discussion groups, and blogs. this paper focuses on online customer reviews of products. it makes two contributions. first, it proposes a novel framework for analyzing and comparing consumer opinions of competing products. a prototype system called opinion observer is also implemented. the system is such that with a single glance of its visualization, the user is able to clearly see the strengths and weaknesses of each product in the minds of consumers in terms of various product features. this comparison is useful to both potential customers and product manufacturers. for a potential customer, he/she can see a visual side-by-side and feature-by-feature comparison of consumer opinions on these products, which helps him/her to decide which product to buy. for a product manufacturer, the comparison enables it to easily gather marketing intelligence and product benchmarking information. second, a new technique based on language pattern mining is proposed to extract product features from pros and cons in a particular type of reviews. such features form the basis for the above comparison. experimental results show that the technique is highly effective and outperform existing methods significantly citee id:120 citee title:learning extraction patterns for subjective expressions citee abstract:this paper presents a bootstrapping process that learns linguistically rich extraction patterns for subjective (opinionated) expressions. high-precision classifiers label unannotated data to automatically create a large training set, which is then given to an extraction pattern learning algorithm. the learned patterns are then used to identify more subjective sentences. the bootstrapping process learns many subjective patterns and increases recall while maintaining high precision. surrounding text:[34]<2> proposes a similar method for nouns. other related works on sentiment classification and opinions discovery include [26, 27, ***, 35, 36]<2>. our work differs from sentiment and subjectivity classification as they do not identify features commented by customers or what customers praise or complain about influence:3 type:2 pair index:939 citer id:782 citer title:opinion observer analyzing and comparing opinions citer abstract:the web has become an excellent source for gathering consumer opinions. there are now numerous web sites containing such opinions, e.g., customer reviews of products, forums, discussion groups, and blogs. this paper focuses on online customer reviews of products. it makes two contributions. first, it proposes a novel framework for analyzing and comparing consumer opinions of competing products. a prototype system called opinion observer is also implemented. the system is such that with a single glance of its visualization, the user is able to clearly see the strengths and weaknesses of each product in the minds of consumers in terms of various product features. this comparison is useful to both potential customers and product manufacturers. for a potential customer, he/she can see a visual side-by-side and feature-by-feature comparison of consumer opinions on these products, which helps him/her to decide which product to buy. for a product manufacturer, the comparison enables it to easily gather marketing intelligence and product benchmarking information. second, a new technique based on language pattern mining is proposed to extract product features from pros and cons in a particular type of reviews. such features form the basis for the above comparison. experimental results show that the technique is highly effective and outperform existing methods significantly citee id:296 citee title:classifying semantic relations in bioscience text citee abstract:a crucial step toward the goal of automatic extraction of propositional information from natural language text is the identification of semantic relations between constituents in sentences. we examine the problem of distinguishing among seven relation types that can occur between the entities "treatment" and "disease" in bioscience text, and the problem of identifying such entities. we compare five generative graphical models and a neural network, using lexical, syntactic, and semantic features, finding that the latter help achieve high classification accuracy surrounding text:recently, information extraction from texts was studied by several researchers. their focus is on using machine learning and nlp methods to extract/classify named entities and relations [5, 10, 14, 20, ***]<3>. our task involves identifying product features which are usually not named entities and can be expressed as nouns, noun phrases, verbs, and adjectives influence:3 type:2,3 pair index:940 citer id:782 citer title:opinion observer analyzing and comparing opinions citer abstract:the web has become an excellent source for gathering consumer opinions. there are now numerous web sites containing such opinions, e.g., customer reviews of products, forums, discussion groups, and blogs. this paper focuses on online customer reviews of products. it makes two contributions. first, it proposes a novel framework for analyzing and comparing consumer opinions of competing products. a prototype system called opinion observer is also implemented. the system is such that with a single glance of its visualization, the user is able to clearly see the strengths and weaknesses of each product in the minds of consumers in terms of various product features. this comparison is useful to both potential customers and product manufacturers. for a potential customer, he/she can see a visual side-by-side and feature-by-feature comparison of consumer opinions on these products, which helps him/her to decide which product to buy. for a product manufacturer, the comparison enables it to easily gather marketing intelligence and product benchmarking information. second, a new technique based on language pattern mining is proposed to extract product features from pros and cons in a particular type of reviews. such features form the basis for the above comparison. experimental results show that the technique is highly effective and outperform existing methods significantly citee id:123 citee title:towards answering opinion questions: separating facts from opinions and identifying the polarity of opinion sentences citee abstract:opinion question answering is a challenging task for natural language processing. in this paper, we discuss a necessary component for an opinion question answering system: separating opinions from fact, at both the document and sentence level. we present a bayesian classifier for discriminating between documents with a preponderance of opinions such as editorials from regular news stories, and describe three unsupervised, statistical techniques for the significantly harder task of detecting opinions at the sentence level. we also present a first model for classifying opinion sentences as positive or negative in terms of the main perspective being expressed in the opinion. results from a large collection of news stories and a human evaluation of 400 sentences are reported, indicating that we achieve very high performance in document classification (upwards of 97% precision and recall), and respectable performance in detecting opinions and classifying them at the sentence level as positive, negative, or neutral (up to 91% accuracy). surrounding text:[34]<2> proposes a similar method for nouns. other related works on sentiment classification and opinions discovery include [26, 27, 30, 35, ***]<2>. our work differs from sentiment and subjectivity classification as they do not identify features commented by customers or what customers praise or complain about influence:3 type:2 pair index:941 citer id:782 citer title:opinion observer analyzing and comparing opinions citer abstract:the web has become an excellent source for gathering consumer opinions. there are now numerous web sites containing such opinions, e.g., customer reviews of products, forums, discussion groups, and blogs. this paper focuses on online customer reviews of products. it makes two contributions. first, it proposes a novel framework for analyzing and comparing consumer opinions of competing products. a prototype system called opinion observer is also implemented. the system is such that with a single glance of its visualization, the user is able to clearly see the strengths and weaknesses of each product in the minds of consumers in terms of various product features. this comparison is useful to both potential customers and product manufacturers. for a potential customer, he/she can see a visual side-by-side and feature-by-feature comparison of consumer opinions on these products, which helps him/her to decide which product to buy. for a product manufacturer, the comparison enables it to easily gather marketing intelligence and product benchmarking information. second, a new technique based on language pattern mining is proposed to extract product features from pros and cons in a particular type of reviews. such features form the basis for the above comparison. experimental results show that the technique is highly effective and outperform existing methods significantly citee id:786 citee title:web data extraction based on partial tree alignment citee abstract:this paper studies the problem of extracting data from a web page that contains several structured data records. the objective is to segment these data records, extract data items/fields from them and put the data in a database table. this problem has been studied by several researchers. however, existing methods still have some serious limitations. the first class of methods is based on machine learning, which requires human labeling of many examples from each web site that one is interested in extracting data from. the process is time consuming due to the large number of sites and pages on the web. the second class of algorithms is based on automatic pattern discovery. these methods are either inaccurate or make many assumptions. this paper proposes a new method to perform the task automatically. it consists of two steps, (1) identifying individual data records in a page, and (2) aligning and extracting data items from the identified data records. for step 1, we propose a method based on visual information to segment data records, which is more accurate than existing methods. for step 2, we propose a novel partial alignment technique based on tree matching. partial alignment means that we align only those data fields in a pair of data records that can be aligned (or matched) with certainty, and make no commitment on the rest of the data fields. this approach enables very accurate alignment of multiple data records. experimental results using a large number of web pages from diverse domains show that the proposed two-step technique is able to segment data records, align and extract data from them very accurately. surrounding text:both these approaches are based on the fact that reviews at each site are displayed according to some fixed layout templates. we use the second approach which is provided by our system mdr-2 [***]<1>, which is improvement of mdr [22]<2>. mdr-2 is able to extract individual data fields in data records. mdr-2 is able to extract individual data fields in data records. due to space limitations, we will not discuss it further (see [22]<2>[***]<1> for more details). 348 4 influence:3 type:3 pair index:942 citer id:787 citer title:opinion retrieval from blogs citer abstract:opinion retrieval is a document retrieval process, which requires documents to be retrieved and ranked according to their opinions about a query topic. a relevant document must satisfy two criteria: relevant to the query topic, and contains opinions about the query, no matter if they are positive or negative. in this paper, we describe an opinion retrieval algorithm. it has a traditional information retrieval (ir) component to find topic relevant documents from a document set, an opinion classification component to find documents having opinions from the results of the ir step, and a component to rank the documents based on their relevance to the query, and their degrees of having opinions about the query. we implemented the algorithm as a working system and tested it using trec 2006 blog track data in automatic title-only runs. our result showed 28% to 32% improvements in map score over the best automatic runs in this 2006 track. our result is also 13% higher than a state-of-art opinion retrieval system, which is tested on the same data set citee id:629 citee title:mining the peanut gallery: opinion extraction and semantic classification of product reviews citee abstract:the web contains a wealth of product reviews, but sifting through them is a daunting task. ideally, an opinion mining tool would process a set of search results for a given item, generating a list of product attributes (quality, features, etc.) and aggregating opinions about each of them (poor, mixed, good). we begin by identifying the unique properties of this problem and develop a method for automatically distinguishing between positive and negative reviews. our classifier draws on information retrieval techniques for feature extraction and scoring, and the results for various metrics and heuristics vary depending on the testing situation. the best methods work as well as or better than traditional machine learning. when operating on individual sentences collected from web searches, performance is limited due to noise and ambiguity. but in the context of a complete web-based tool and aided by a simple method for grouping sentences into attributes, the results are qualitatively quite useful. surrounding text:dave et al. [***]<2> also experiment with a number of learning methods for review classification. they show that the classifiers perform well on whole reviews, but poorly on sentences because a sentence contains much less information influence:2 type:2 pair index:943 citer id:787 citer title:opinion retrieval from blogs citer abstract:opinion retrieval is a document retrieval process, which requires documents to be retrieved and ranked according to their opinions about a query topic. a relevant document must satisfy two criteria: relevant to the query topic, and contains opinions about the query, no matter if they are positive or negative. in this paper, we describe an opinion retrieval algorithm. it has a traditional information retrieval (ir) component to find topic relevant documents from a document set, an opinion classification component to find documents having opinions from the results of the ir step, and a component to rank the documents based on their relevance to the query, and their degrees of having opinions about the query. we implemented the algorithm as a working system and tested it using trec 2006 blog track data in automatic title-only runs. our result showed 28% to 32% improvements in map score over the best automatic runs in this 2006 track. our result is also 13% higher than a state-of-art opinion retrieval system, which is tested on the same data set citee id:410 citee title:determining the semantic orientation of terms through gloss analysis citee abstract:sentiment classification is a recent subdiscipline of text classification which is concerned not with the topic a document is about, but with the opinion it expresses. it has a rich set of applications, ranging from tracking users' opinions about products or about political candidates as expressed in online forums, to customer relationship management. functional to the extraction of opinions from text is the determination of the orientation of ``subjective'' terms contained in text, i.e. the determination of whether a term that carries opinionated content has a positive or a negative connotation. in this paper we present a new method for determining the orientation of subjective terms. the method is based on the quantitative analysis of the glosses of such terms, i.e. the definitions that these terms are given in on-line dictionaries, and on the use of the resulting term representations for semi-supervised term classification. the method we present outperforms all known methods when tested on the recognized standard benchmarks for this task surrounding text:[26]<2> identified subjective language features, such as low-frequency words, word collocations, adjectives and verbs, from corpora and used them in the sentiment classification. esuli and sebastiani [***]<2> classified the orientation of a term based on its dictionary glosses. whitelaw, garg, and argamon [25]<2> used short phrases called appraisal groups subjective language features for sentiment classification influence:3 type:2 pair index:944 citer id:787 citer title:opinion retrieval from blogs citer abstract:opinion retrieval is a document retrieval process, which requires documents to be retrieved and ranked according to their opinions about a query topic. a relevant document must satisfy two criteria: relevant to the query topic, and contains opinions about the query, no matter if they are positive or negative. in this paper, we describe an opinion retrieval algorithm. it has a traditional information retrieval (ir) component to find topic relevant documents from a document set, an opinion classification component to find documents having opinions from the results of the ir step, and a component to rank the documents based on their relevance to the query, and their degrees of having opinions about the query. we implemented the algorithm as a working system and tested it using trec 2006 blog track data in automatic title-only runs. our result showed 28% to 32% improvements in map score over the best automatic runs in this 2006 track. our result is also 13% higher than a state-of-art opinion retrieval system, which is tested on the same data set citee id:459 citee title:effects of adjective orientation and gradability on sentence subjectivity citee abstract:subjectivity is a pragmatic, sentence-level feature that has important implications for texl processing applicalions such as information exlractiou and information iclricwd. we study tile elfeels of dymunic adjectives, semantically oriented adjectives, and gradable ad.ieclivcs on a simple subjectivity classiiicr, and establish lhat lhcy arc strong predictors of subjectivity. a novel trainable mclhod thai statistically combines two indicators of gradability is presented and ewlhlalcd, complementing exisling automatic icchniques for assigning orientation labels. surrounding text:hatzivassiloglou and mckeown [5]<2> studied the adjectives as evidence of subjective texts. hatzivassiloglou and wiebe [***]<2> investigated sentence subjectivity classification. a method is proposed to find adjectives that are indicative of positive or negative opinions influence:3 type:2 pair index:945 citer id:787 citer title:opinion retrieval from blogs citer abstract:opinion retrieval is a document retrieval process, which requires documents to be retrieved and ranked according to their opinions about a query topic. a relevant document must satisfy two criteria: relevant to the query topic, and contains opinions about the query, no matter if they are positive or negative. in this paper, we describe an opinion retrieval algorithm. it has a traditional information retrieval (ir) component to find topic relevant documents from a document set, an opinion classification component to find documents having opinions from the results of the ir step, and a component to rank the documents based on their relevance to the query, and their degrees of having opinions about the query. we implemented the algorithm as a working system and tested it using trec 2006 blog track data in automatic title-only runs. our result showed 28% to 32% improvements in map score over the best automatic runs in this 2006 track. our result is also 13% higher than a state-of-art opinion retrieval system, which is tested on the same data set citee id:403 citee title:mining and summarizing customer reviews citee abstract:merchants selling products on the web often ask their customers to review the products that they have purchased and the associated services. as e-commerce is becoming more and more popular, the number of customer reviews that a product receives grows rapidly. for a popular product, the number of reviews can be in hundreds or even thousands. this makes it difficult for a potential customer to read them to make an informed decision on whether to purchase the product. it also makes it difficult for the manufacturer of the product to keep track and to manage customer opinions. for the manufacturer, there are additional difficulties because many merchant sites may sell the same product and the manufacturer normally produces many kinds of products. in this research, we aim to mine and to summarize all the customer reviews of a product. this summarization task is different from traditional text summarization because we only mine the features of the product on which the customers have expressed their opinions and whether the opinions are positive or negative. we do not summarize the reviews by selecting a subset or rewrite some of the original sentences from the reviews to capture the main points as in the classic text summarization. our task is performed in three steps: (1) mining product features that have been commented on by customers; (2) identifying opinion sentences in each review and deciding whether each opinion sentence is positive or negative; (3) summarizing the results. this paper proposes several novel techniques to perform these tasks. our experimental results using reviews of a number of products sold online demonstrate the effectiveness of the techniques surrounding text:ve bayes classifiers with both manual and automatic rules. other opinion identification works include [12][28][***]<2>. in the document-level opinion classification, das and chen [2]<2> use a manually crafted lexicon in conjunction with several scoring methods to extract investor sentiment from stock message boards influence:2 type:2 pair index:946 citer id:787 citer title:opinion retrieval from blogs citer abstract:opinion retrieval is a document retrieval process, which requires documents to be retrieved and ranked according to their opinions about a query topic. a relevant document must satisfy two criteria: relevant to the query topic, and contains opinions about the query, no matter if they are positive or negative. in this paper, we describe an opinion retrieval algorithm. it has a traditional information retrieval (ir) component to find topic relevant documents from a document set, an opinion classification component to find documents having opinions from the results of the ir step, and a component to rank the documents based on their relevance to the query, and their degrees of having opinions about the query. we implemented the algorithm as a working system and tested it using trec 2006 blog track data in automatic title-only runs. our result showed 28% to 32% improvements in map score over the best automatic runs in this 2006 track. our result is also 13% higher than a state-of-art opinion retrieval system, which is tested on the same data set citee id:626 citee title:identifying comparative sentences in text documents citee abstract:this paper studies the problem of identifying comparative sentences in text documents. the problem is related to but quite different from sentiment/opinion sentence identification or classification. sentiment classification studies the problem of classifying a document or a sentence based on the subjective opinion of the author. an important application area of sentiment/opinion identification is business intelligence as a product manufacturer always wants to know consumers opinions on its products. comparisons on the other hand can be subjective or objective. furthermore, a comparison is not concerned with an object in isolation. instead, it compares the object with others. an example opinion sentence is the sound quality of cd player x is poor. an example comparative sentence is the sound quality of cd player x is not as good as that of cd player y. clearly, these two sentences give different information. their language constructs are quite different too. identifying comparative sentences is also useful in practice because direct comparisons are perhaps one of the most convincing ways of evaluation, which may even be more important than opinions on each individual object. this paper proposes to study the comparative sentence identification problem. it first categorizes comparative sentences into different types, and then presents a novel integrated pattern discovery and supervised learning approach to identifying comparative sentences from text documents. experiment results using three types of documents, news articles, consumer reviews of products, and internet forum postings, show a precision of 79% and recall of 81%. more detailed results are given in the paper surrounding text:they use heuristic rules and supervised learning techniques to find product features and opinions. recently, jindal and liu [***]<2> also studied the opinions in comparative sentences using the support vector machine and the na. ve bayes classifiers with both manual and automatic rules influence:3 type:2 pair index:947 citer id:787 citer title:opinion retrieval from blogs citer abstract:opinion retrieval is a document retrieval process, which requires documents to be retrieved and ranked according to their opinions about a query topic. a relevant document must satisfy two criteria: relevant to the query topic, and contains opinions about the query, no matter if they are positive or negative. in this paper, we describe an opinion retrieval algorithm. it has a traditional information retrieval (ir) component to find topic relevant documents from a document set, an opinion classification component to find documents having opinions from the results of the ir step, and a component to rank the documents based on their relevance to the query, and their degrees of having opinions about the query. we implemented the algorithm as a working system and tested it using trec 2006 blog track data in automatic title-only runs. our result showed 28% to 32% improvements in map score over the best automatic runs in this 2006 track. our result is also 13% higher than a state-of-art opinion retrieval system, which is tested on the same data set citee id:703 citee title:major topic detection and its application to opinion summarization citee abstract:watching specific information sources and summarizing the newly discovered opinions is important for governments to improve their services and companies to improve their products . because no queries are posed beforehand, detecting opinions is similar to the task of topic detection on sentence level. besides telling which opinions are positive or negative, identifying which events correlated with such opinions are also important. this paper proposes a major topic detection mechanism to capture main concepts embedded implicitly in a relevant document set. opinion summarization further retrieves all the relevant sentences related to the major topic from the document set, determines the opinion polarity of each relevant sentence, and finally summarizes positive and negative sentences, respectively. surrounding text:ku et al. [***]<2> summarize blog documents by their major topics and the related opinions. 3 influence:2 type:2 pair index:948 citer id:787 citer title:opinion retrieval from blogs citer abstract:opinion retrieval is a document retrieval process, which requires documents to be retrieved and ranked according to their opinions about a query topic. a relevant document must satisfy two criteria: relevant to the query topic, and contains opinions about the query, no matter if they are positive or negative. in this paper, we describe an opinion retrieval algorithm. it has a traditional information retrieval (ir) component to find topic relevant documents from a document set, an opinion classification component to find documents having opinions from the results of the ir step, and a component to rank the documents based on their relevance to the query, and their degrees of having opinions about the query. we implemented the algorithm as a working system and tested it using trec 2006 blog track data in automatic title-only runs. our result showed 28% to 32% improvements in map score over the best automatic runs in this 2006 track. our result is also 13% higher than a state-of-art opinion retrieval system, which is tested on the same data set citee id:114 citee title:opinion observer: analyzing and comparing opinions on the web citee abstract:the web has become an excellent source for gathering consumer opinions. there are now numerous web sites containing such opinions, e.g., customer reviews of products, forums, discussion groups, and blogs. this paper focuses on online customer reviews of products. it makes two contributions. first, it proposes a novel framework for analyzing and comparing consumer opinions of competing products. a prototype system called opinion observer is also implemented. the system is such that with a single glance of its visualization, the user is able to clearly see the strengths and weaknesses of each product in the minds of consumers in terms of various product features. this comparison is useful to both potential customers and product manufacturers. for a potential customer, he/she can see a visual side-by-side and feature-by-feature comparison of consumer opinions on these products, which helps him/her to decide which product to buy. for a product manufacturer, the comparison enables it to easily gather marketing intelligence and product benchmarking information. second, a new technique based on language pattern mining is proposed to extract product features from pros and cons in a particular type of reviews. such features form the basis for the above comparison. experimental results show that the technique is highly effective and outperform existing methods significantly. surrounding text:a document telling a drivers feelings about his audi a4 is relevant in opinion retrieval$ however, a document only describing mechanic features of an audi a4 is not relevant. although the machine learning and natural language processing communities have worked on finding opinions about targets such as products [***]<3>, movie reviews [18]<3> in documents, they usually assume that the documents have already contained the relevant opinions. they do not address the general problem of how to retrieve the opinionative documents from a collection influence:2 type:2 pair index:949 citer id:787 citer title:opinion retrieval from blogs citer abstract:opinion retrieval is a document retrieval process, which requires documents to be retrieved and ranked according to their opinions about a query topic. a relevant document must satisfy two criteria: relevant to the query topic, and contains opinions about the query, no matter if they are positive or negative. in this paper, we describe an opinion retrieval algorithm. it has a traditional information retrieval (ir) component to find topic relevant documents from a document set, an opinion classification component to find documents having opinions from the results of the ir step, and a component to rank the documents based on their relevance to the query, and their degrees of having opinions about the query. we implemented the algorithm as a working system and tested it using trec 2006 blog track data in automatic title-only runs. our result showed 28% to 32% improvements in map score over the best automatic runs in this 2006 track. our result is also 13% higher than a state-of-art opinion retrieval system, which is tested on the same data set citee id:218 citee title:an effective approach to document retrieval via utilizing wordnet and recognizing phrases citee abstract:noun phrases in queries are identified and classified into four types: proper names, dictionary phrases, simple phrases and complex phrases. a document has a phrase if all content words in the phrase are within a window of a certain size. the window sizes for different types of phrases are different and are determined using a decision tree. phrases are more important than individual terms. consequently, documents in response to a query are ranked with matching phrases given a higher priority. we utilize wordnet to disambiguate word senses of query terms. whenever the sense of a query term is determined, its synonyms, hyponyms, words from its definition and its compound words are considered for possible additions to the query. experimental results show that our approach yields between 23% and 31% improvements over the best-known results on the trec 9, 10 and 12 collections for short (title only) queries, without using web data surrounding text:in the experiment part, we show that the score of our system is about 13% higher than that of this system. zhang and yu [31]<2> adopt concept-based ir [***]<2>, machinelearning approach based opinion detection and two separate opinion similarity functions in their system. their system, which was an earlier version of our system, obtained the best performance among the groups using the title-only queries [17]<2> influence:3 type:3 pair index:950 citer id:787 citer title:opinion retrieval from blogs citer abstract:opinion retrieval is a document retrieval process, which requires documents to be retrieved and ranked according to their opinions about a query topic. a relevant document must satisfy two criteria: relevant to the query topic, and contains opinions about the query, no matter if they are positive or negative. in this paper, we describe an opinion retrieval algorithm. it has a traditional information retrieval (ir) component to find topic relevant documents from a document set, an opinion classification component to find documents having opinions from the results of the ir step, and a component to rank the documents based on their relevance to the query, and their degrees of having opinions about the query. we implemented the algorithm as a working system and tested it using trec 2006 blog track data in automatic title-only runs. our result showed 28% to 32% improvements in map score over the best automatic runs in this 2006 track. our result is also 13% higher than a state-of-art opinion retrieval system, which is tested on the same data set citee id:788 citee title:overview of the trec-2006 blog track citee abstract:the rise on the internet of blogging, the creation of journal-like web page logs, has created a highly dynamic subset of the world wide web that evolves and responds to real-world events. indeed, blogs (or weblogs) have recently emerged as a new grassroots publishing medium. the so-called blogosphere (the collection of blogs on the internet) opens up several new interesting research areas. blogs have many interesting features: entries are added in chronological order, sometimes at a high volume. in addition, many blogs are created by their authors, not intended for any sizable audience, but purely as a mechanism for self-expression. extremely accessible blog software has facilitated the act of blogging to a wide-ranging audience, their blogs reflecting their opinions, philosophies and emotions. traditional media tends to focus on heavy-hitting blogs devoted to politics, punditry and technology. however, there are many different genres of blogs, some written around a specific topic, some covering several, and others talking about personal daily life . the blog track began this year, with the aim to explore the information seeking behaviour in the blogosphere. for this purpose, a new large-scale test collection, namely the trec blog06 collection, has been created. in the first pilot run of the track in 2006, we had two tasks, a main task (opinion retrieval) and an open task. the opinion retrieval task focuses on a specific aspect of blogs: the opinionated nature of many blogs. the second task was introduced to allow participants the opportunity to influence the determination of a suitable second task (for 2007) on other aspects of blogs, such as the temporal/event-related nature of many blogs, or the severity of spam in the blogosphere. surrounding text:but these blog search tools are just the applications of the traditional web search techniques in the blog domain, in that, given a query, they only search for fact-related information in the blogs, however, the opinions about various topics, which are very important feature of the blog documents, are not searched by these tools. in 2006, the text retrieval conference (trec) brought up a blog track for the first time focusing on information retrieval in the blog documents [***]<3>. the major task of this track was the opinion retrieval influence:3 type:3 pair index:951 citer id:787 citer title:opinion retrieval from blogs citer abstract:opinion retrieval is a document retrieval process, which requires documents to be retrieved and ranked according to their opinions about a query topic. a relevant document must satisfy two criteria: relevant to the query topic, and contains opinions about the query, no matter if they are positive or negative. in this paper, we describe an opinion retrieval algorithm. it has a traditional information retrieval (ir) component to find topic relevant documents from a document set, an opinion classification component to find documents having opinions from the results of the ir step, and a component to rank the documents based on their relevance to the query, and their degrees of having opinions about the query. we implemented the algorithm as a working system and tested it using trec 2006 blog track data in automatic title-only runs. our result showed 28% to 32% improvements in map score over the best automatic runs in this 2006 track. our result is also 13% higher than a state-of-art opinion retrieval system, which is tested on the same data set citee id:118 citee title:thumbs up? sentiment classification using machine learning techniques citee abstract:we consider the problem of classifying documents not by topic, but by overall sentiment, e.g., determining whether a review is positive or negative. using movie reviews as data, we find that standard machine learning techniques definitively outperform human-produced baselines. however, the three machine learning methods we employed (naive bayes, maximum entropy classification, and support vector machines) do not perform as well on sentiment classification as on traditional topic-based categorization. we conclude by examining factors that make the sentiment classification problem more challenging. surrounding text:a document telling a drivers feelings about his audi a4 is relevant in opinion retrieval$ however, a document only describing mechanic features of an audi a4 is not relevant. although the machine learning and natural language processing communities have worked on finding opinions about targets such as products [15]<3>, movie reviews [***]<3> in documents, they usually assume that the documents have already contained the relevant opinions. they do not address the general problem of how to retrieve the opinionative documents from a collection. during the text processing, we only do porter stemming but do not remove any stop words. after the features are selected, all the subjective and objective sentences are represented as feature-presence vectors, where the presence or absence of each feature is recorded [***]<3>. these vectors are used to train the svm classifier influence:3 type:2 pair index:952 citer id:787 citer title:opinion retrieval from blogs citer abstract:opinion retrieval is a document retrieval process, which requires documents to be retrieved and ranked according to their opinions about a query topic. a relevant document must satisfy two criteria: relevant to the query topic, and contains opinions about the query, no matter if they are positive or negative. in this paper, we describe an opinion retrieval algorithm. it has a traditional information retrieval (ir) component to find topic relevant documents from a document set, an opinion classification component to find documents having opinions from the results of the ir step, and a component to rank the documents based on their relevance to the query, and their degrees of having opinions about the query. we implemented the algorithm as a working system and tested it using trec 2006 blog track data in automatic title-only runs. our result showed 28% to 32% improvements in map score over the best automatic runs in this 2006 track. our result is also 13% higher than a state-of-art opinion retrieval system, which is tested on the same data set citee id:117 citee title:seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales citee abstract:we address the rating-inference problem, wherein rather than simply decide whether a review is "thumbs up" or "thumbs down", as in previous sentiment analysis work, one must determine an author's evaluation with respect to a multi-point scale (e.g., one to five "stars"). this task represents an interesting twist on standard multi-class text categorization because there are several different degrees of similarity between class labels; for example, "three stars" is intuitively closer to "four stars" than to "one star". we first evaluate human performance at the task. then, we apply a meta-algorithm, based on a metric labeling formulation of the problem, that alters a given n-ary classifier's output in an explicit attempt to ensure that similar items receive similar labels. we show that the meta-algorithm can provide significant improvements over both multi-class and regression versions of svms when we employ a novel similarity measure appropriate to the problem. surrounding text:832 [18]<2> examines several supervised machine-learning methods for positive/negative sentiment classification of movie reviews. pang and lee [***]<2> also studied the problems of multi-category classification of movie reviews by using supervised learning. various types of machine-learning techniques have been applied to the opinion classification task influence:3 type:2 pair index:953 citer id:787 citer title:opinion retrieval from blogs citer abstract:opinion retrieval is a document retrieval process, which requires documents to be retrieved and ranked according to their opinions about a query topic. a relevant document must satisfy two criteria: relevant to the query topic, and contains opinions about the query, no matter if they are positive or negative. in this paper, we describe an opinion retrieval algorithm. it has a traditional information retrieval (ir) component to find topic relevant documents from a document set, an opinion classification component to find documents having opinions from the results of the ir step, and a component to rank the documents based on their relevance to the query, and their degrees of having opinions about the query. we implemented the algorithm as a working system and tested it using trec 2006 blog track data in automatic title-only runs. our result showed 28% to 32% improvements in map score over the best automatic runs in this 2006 track. our result is also 13% higher than a state-of-art opinion retrieval system, which is tested on the same data set citee id:119 citee title:extracting product features and opinions from reviews citee abstract:consumers are often forced to wade through many on-line reviews in order to make an informed product choice. this paper introduces opine, an unsupervised information-extraction system which mines reviews in order to build a model of important product features, their evaluation by reviewers, and their relative quality across products.compared to previous work, opine achieves 22% higher precision (with only 3% lower recall) on the feature extraction task. opine's novel use of relaxation labeling for finding the semantic orientation of words in context leads to strong performance on the tasks of finding opinion phrases and their polarity. surrounding text:whitelaw, garg, and argamon [25]<2> used short phrases called appraisal groups subjective language features for sentiment classification. in the opinion identification works, liu, hu and chen [15]<2> and popescu and etzioni [***]<2> both extracted product features and their opinions from the web. they use heuristic rules and supervised learning techniques to find product features and opinions influence:2 type:2 pair index:954 citer id:787 citer title:opinion retrieval from blogs citer abstract:opinion retrieval is a document retrieval process, which requires documents to be retrieved and ranked according to their opinions about a query topic. a relevant document must satisfy two criteria: relevant to the query topic, and contains opinions about the query, no matter if they are positive or negative. in this paper, we describe an opinion retrieval algorithm. it has a traditional information retrieval (ir) component to find topic relevant documents from a document set, an opinion classification component to find documents having opinions from the results of the ir step, and a component to rank the documents based on their relevance to the query, and their degrees of having opinions about the query. we implemented the algorithm as a working system and tested it using trec 2006 blog track data in automatic title-only runs. our result showed 28% to 32% improvements in map score over the best automatic runs in this 2006 track. our result is also 13% higher than a state-of-art opinion retrieval system, which is tested on the same data set citee id:790 citee title:using appraisal groups for sentiment analysis citee abstract:little work to date in sentiment analysis (classifying texts by positive or negative orientation) has attempted to use fine-grained semantic distinctions in features used for classification. we present a new method for sentiment classification based on extracting and analyzing appraisal groups such as very good or not terribly funny. an appraisal group is represented as a set of attribute values in several task-independent semantic taxonomies, based on appraisal theory. semi-automated methods were used to build a lexicon of appraising adjectives and their modifiers. we classify movie reviews using features based upon these taxonomies combined with standard bag-of-words features, and report state-of-the-art accuracy of 90.2%. in addition, we find that some types of appraisal appear to be more significant for sentiment classification than others surrounding text:esuli and sebastiani [4]<2> classified the orientation of a term based on its dictionary glosses. whitelaw, garg, and argamon [***]<2> used short phrases called appraisal groups subjective language features for sentiment classification. in the opinion identification works, liu, hu and chen [15]<2> and popescu and etzioni [20]<2> both extracted product features and their opinions from the web influence:3 type:2 pair index:955 citer id:787 citer title:opinion retrieval from blogs citer abstract:opinion retrieval is a document retrieval process, which requires documents to be retrieved and ranked according to their opinions about a query topic. a relevant document must satisfy two criteria: relevant to the query topic, and contains opinions about the query, no matter if they are positive or negative. in this paper, we describe an opinion retrieval algorithm. it has a traditional information retrieval (ir) component to find topic relevant documents from a document set, an opinion classification component to find documents having opinions from the results of the ir step, and a component to rank the documents based on their relevance to the query, and their degrees of having opinions about the query. we implemented the algorithm as a working system and tested it using trec 2006 blog track data in automatic title-only runs. our result showed 28% to 32% improvements in map score over the best automatic runs in this 2006 track. our result is also 13% higher than a state-of-art opinion retrieval system, which is tested on the same data set citee id:695 citee title:learning subjective language citee abstract:subjectivity in natural language refers to aspects of language used to express opinions, evaluations, and speculations. there are numerous natural language processing applications for which subjectivity analysis is relevant, including information extraction and text categorization. the goal of this work is learning subjective language from corpora. clues of subjectivity are generated and tested, including low-frequency words, collocations, and adjectives and verbs identified using distributional similarity. the features are also examined working together in concert. the features, generated from different data sets using different procedures, exhibit consistency in performance in that they all do better and worse on the same data sets. in addition, this article shows that the density of subjectivity clues in the surrounding context strongly affects how likely it is that a word is subjective, and it provides the results of an annotation study assessing the subjectivity of sentences with high-density features. finally, the clues are used to perform opinion piece recognition (a type of text categorization and genre detection) to demonstrate the utility of the knowledge acquired in this article. surrounding text:wiebe et al. [***]<2> identified subjective language features, such as low-frequency words, word collocations, adjectives and verbs, from corpora and used them in the sentiment classification. esuli and sebastiani [4]<2> classified the orientation of a term based on its dictionary glosses influence:3 type:2 pair index:956 citer id:787 citer title:opinion retrieval from blogs citer abstract:opinion retrieval is a document retrieval process, which requires documents to be retrieved and ranked according to their opinions about a query topic. a relevant document must satisfy two criteria: relevant to the query topic, and contains opinions about the query, no matter if they are positive or negative. in this paper, we describe an opinion retrieval algorithm. it has a traditional information retrieval (ir) component to find topic relevant documents from a document set, an opinion classification component to find documents having opinions from the results of the ir step, and a component to rank the documents based on their relevance to the query, and their degrees of having opinions about the query. we implemented the algorithm as a working system and tested it using trec 2006 blog track data in automatic title-only runs. our result showed 28% to 32% improvements in map score over the best automatic runs in this 2006 track. our result is also 13% higher than a state-of-art opinion retrieval system, which is tested on the same data set citee id:12 citee title:a comparative study on feature selection in text categorization citee abstract:this paper is a comparative study of feature selection methods in statistical learning of text categorization. the focus is on aggressive dimensionality reduction. five methods were evaluated, including term selection based on document frequency (df), information gain (ig), mutual information (mi), a ? 2 -test (chi), and term strength (ts). we found ig and chi most effective in our experiments. using ig thresholding with a knearest neighbor classifier on the reuters corpus, removal of up to surrounding text:only a subset of them is chosen via the pearsons chi-square test[1]<1>. yang [***]<3> reported that chi-square test was an effective feature selection approach. to find out how dependent a feature f is with respect to the subjective set or the objective set, we set up a null hypothesis that f is independent of the two categories with respect to its occurrences in the two sets influence:3 type:3 pair index:957 citer id:787 citer title:opinion retrieval from blogs citer abstract:opinion retrieval is a document retrieval process, which requires documents to be retrieved and ranked according to their opinions about a query topic. a relevant document must satisfy two criteria: relevant to the query topic, and contains opinions about the query, no matter if they are positive or negative. in this paper, we describe an opinion retrieval algorithm. it has a traditional information retrieval (ir) component to find topic relevant documents from a document set, an opinion classification component to find documents having opinions from the results of the ir step, and a component to rank the documents based on their relevance to the query, and their degrees of having opinions about the query. we implemented the algorithm as a working system and tested it using trec 2006 blog track data in automatic title-only runs. our result showed 28% to 32% improvements in map score over the best automatic runs in this 2006 track. our result is also 13% higher than a state-of-art opinion retrieval system, which is tested on the same data set citee id:761 citee title:multiple ranking strategies for opinion retrieval in blogs citee abstract:we describe our participation in the opinion retrieval task at trec 2006. our approach to identifying opinions in blog post consisted of scoring the posts separately on various aspects associated with an expression of opinion about a topic, including shallow sentiment analysis, spam detection, and link-based authority estimation. the separate approaches were combined into a single ranking, yielding significant improvement over a content-only baseline. surrounding text:the groups with the top performances were [***][31][32][33]<2>. mishne [***]<2> adopts fact-oriented ir, dictionary-based opinion expression detection and spam filtering as three major components in his system. this system not only utilizes the blog documents being tested, but also the rss seeds as additional training data influence:1 type:2 pair index:958 citer id:787 citer title:opinion retrieval from blogs citer abstract:opinion retrieval is a document retrieval process, which requires documents to be retrieved and ranked according to their opinions about a query topic. a relevant document must satisfy two criteria: relevant to the query topic, and contains opinions about the query, no matter if they are positive or negative. in this paper, we describe an opinion retrieval algorithm. it has a traditional information retrieval (ir) component to find topic relevant documents from a document set, an opinion classification component to find documents having opinions from the results of the ir step, and a component to rank the documents based on their relevance to the query, and their degrees of having opinions about the query. we implemented the algorithm as a working system and tested it using trec 2006 blog track data in automatic title-only runs. our result showed 28% to 32% improvements in map score over the best automatic runs in this 2006 track. our result is also 13% higher than a state-of-art opinion retrieval system, which is tested on the same data set citee id:791 citee title:using blog properties to improve retrieval citee abstract:this paper describes three simple heuristics which improve opinion retrieval effectiveness by using blog-specific properties. blog timestamps are used to increase the retrieval scores of blog posts published near the time of a significant event related to a query; an inexpensive approach to comment amount estimation is used to identify the level of opinion expressed in a post; and query-specific weights are used to change the importance of spam filtering for different types of queries. overall, these methods, combined with non-blogspecific retrieval approaches, result in substantial improvements over state-of-the-art. surrounding text:this system not only utilizes the blog documents being tested, but also the rss seeds as additional training data. mishne [***]<2> recently reported an opinion retrieval system that utilizes the publishing dates of the documents as a feature in the retrieval. this system also compares the contents of actual blog posts and the rss documents to calculate the proportion of the comments in a blog document. 6. 5 comparing to related works in table 7, we compare the best maps from our system to the best maps from various run categories in trec 2006 blog track, and that from a state-of-art opinion retrieval system by mishne [***]<2> in 2007. all these systems are tested on the same trec blog data and query set. in practice, the title-only run is more realistic because users rarely provide the descriptions and narratives when they submit queries. as introduced in the related works, mishne [***]<2> recently reported a state-of-art opinion retrieval system in 2007 that had a strong performance. this system achieved a map of 0 influence:1 type:2 pair index:959 citer id:787 citer title:opinion retrieval from blogs citer abstract:opinion retrieval is a document retrieval process, which requires documents to be retrieved and ranked according to their opinions about a query topic. a relevant document must satisfy two criteria: relevant to the query topic, and contains opinions about the query, no matter if they are positive or negative. in this paper, we describe an opinion retrieval algorithm. it has a traditional information retrieval (ir) component to find topic relevant documents from a document set, an opinion classification component to find documents having opinions from the results of the ir step, and a component to rank the documents based on their relevance to the query, and their degrees of having opinions about the query. we implemented the algorithm as a working system and tested it using trec 2006 blog track data in automatic title-only runs. our result showed 28% to 32% improvements in map score over the best automatic runs in this 2006 track. our result is also 13% higher than a state-of-art opinion retrieval system, which is tested on the same data set citee id:792 citee title:widit in trec 2006 blog track citee abstract:web information discovery integrated tool (widit) laboratory at the indiana university school of library and information science participated in the blog tracks opinion task in trec2006. the goal of opinion task is to "uncover the public sentiment towards a given entity/target", which involves not only retrieving topically relevant blogs but also identifying those that contain opinions about the target. to further complicate the matter, the blog test collection contains considerable amount of noise, such as blogs with non-english content and non-blog content (e.g., advertisement, navigational text), which may misdirect retrieval systems. based on our hypothesis that noise reduction (e.g., exclusion of non-english blogs, navigational text) will improve both on-topic and opinion retrieval performances, we explored various noise reduction approaches that can effectively eliminate the noise in blog data without inadvertently excluding valid content. after creating two separate indexes (with and without noise) to assess the noise reduction effect, we tackled the opinion blog retrieval task by breaking it down to two sequential subtasks: on-topic retrieval followed by opinion classification. our opinion retrieval approach was to first apply traditional ir methods to retrieve on-topic blogs, and then boost the ranks of opinionated blogs based on opinion scores generated by opinion assessment methods. our opinion module consists of opinion term module, which identify opinions based on the frequency of opinion terms (i.e., terms that only occur frequently in opinion blogs), rare term module, which uses uncommon/rare terms (e.g., sooo good) for opinion classification, iu module, which uses iu (i and you) collocations, and adjective-verb module, which uses computational linguistics distribution similarity approach to learn the subjective language from training data. surrounding text:yang et al. [***]<2> adopt ir components that utilizes proximity match and phrase match. their opinion component adopts frequency-based heuristics, special pronoun patterns and adjective/adverb-based heuristics influence:2 type:2 pair index:960 citer id:787 citer title:opinion retrieval from blogs citer abstract:opinion retrieval is a document retrieval process, which requires documents to be retrieved and ranked according to their opinions about a query topic. a relevant document must satisfy two criteria: relevant to the query topic, and contains opinions about the query, no matter if they are positive or negative. in this paper, we describe an opinion retrieval algorithm. it has a traditional information retrieval (ir) component to find topic relevant documents from a document set, an opinion classification component to find documents having opinions from the results of the ir step, and a component to rank the documents based on their relevance to the query, and their degrees of having opinions about the query. we implemented the algorithm as a working system and tested it using trec 2006 blog track data in automatic title-only runs. our result showed 28% to 32% improvements in map score over the best automatic runs in this 2006 track. our result is also 13% higher than a state-of-art opinion retrieval system, which is tested on the same data set citee id:793 citee title:trec-2006 at maryland: blog, enterprise, legal and qa tracks citee abstract:in trec 2006, teams from the university of maryland participated in the blog track, the expert search task of the enterprise track, the complex interactive question answering task of the question answering track, and the legal track. this paper reports our results. surrounding text:14 groups participated in this task. the groups with the top performances were [29][31][32][***]<2>. mishne [29]<2> adopts fact-oriented ir, dictionary-based opinion expression detection and spam filtering as three major components in his system. oard et al. [***]<2> adopt passage retrieval. both dictionary-based and machine-learning based sentiment term selection methods influence:3 type:2,3 pair index:961 citer id:787 citer title:opinion retrieval from blogs citer abstract:opinion retrieval is a document retrieval process, which requires documents to be retrieved and ranked according to their opinions about a query topic. a relevant document must satisfy two criteria: relevant to the query topic, and contains opinions about the query, no matter if they are positive or negative. in this paper, we describe an opinion retrieval algorithm. it has a traditional information retrieval (ir) component to find topic relevant documents from a document set, an opinion classification component to find documents having opinions from the results of the ir step, and a component to rank the documents based on their relevance to the query, and their degrees of having opinions about the query. we implemented the algorithm as a working system and tested it using trec 2006 blog track data in automatic title-only runs. our result showed 28% to 32% improvements in map score over the best automatic runs in this 2006 track. our result is also 13% higher than a state-of-art opinion retrieval system, which is tested on the same data set citee id:794 citee title:recognition and classification of noun phrases in queries for effective retrieval citee abstract:it has been shown that using phrases properly in the document retrieval leads to higher retrieval effectiveness. in this paper, we define four types of noun phrases and present an algorithm for recognizing these phrases in queries. the strengths of several existing tools are combined for phrase recognition. our algorithm is tested using a set of 500 web queries from a query log, and a set of 238 trec queries. experimental results show that our algorithm yields high phrase recognition accuracy. we also use a baseline noun phrase recognition algorithm to recognize phrases from the trec queries. a document retrieval experiment is conducted using the trec queries (1) without any phrases, (2) with the phrases recognized from a baseline noun phrase recognition algorithm, and (3) with the phrases recognized from our algorithm respectively. the retrieval effectiveness of (3) is better than that of (2), which is better than that of (1). this demonstrates that utilizing phrases in queries does improve the retrieval effectiveness, and better noun phrase recognition yields higher retrieval performance surrounding text:3. 1 concept identification a concept in a query is defined as either a multi-word phrase consisting of adjacent query words, or a single word that does not belong to any other concept [***]<3>. concept identification is a query pre-processing procedure, which partitions an original query to these concepts influence:3 type:3 pair index:962 citer id:787 citer title:opinion retrieval from blogs citer abstract:opinion retrieval is a document retrieval process, which requires documents to be retrieved and ranked according to their opinions about a query topic. a relevant document must satisfy two criteria: relevant to the query topic, and contains opinions about the query, no matter if they are positive or negative. in this paper, we describe an opinion retrieval algorithm. it has a traditional information retrieval (ir) component to find topic relevant documents from a document set, an opinion classification component to find documents having opinions from the results of the ir step, and a component to rank the documents based on their relevance to the query, and their degrees of having opinions about the query. we implemented the algorithm as a working system and tested it using trec 2006 blog track data in automatic title-only runs. our result showed 28% to 32% improvements in map score over the best automatic runs in this 2006 track. our result is also 13% higher than a state-of-art opinion retrieval system, which is tested on the same data set citee id:795 citee title:query expansion using local and global document analysis citee abstract:automatic query expansion has long been suggested as a technique for dealing with the fundamental issue of word mismatch in information retrieval. a number of approaches to ezpanrnion have been studied and, more recently, attention has focused on techniques that analyze the corpus to discover word relationship (global techniques) and those that analyze documents retrieved by the initial quer~ ( local feedback). in this paper, we compare the effectiveness of these approaches and show that, although global analysis haa some advantages, local analysia is generally more effective. we also show that using global analysis techniques, such as word contezt and phrase structure, on the local aet of documents produces results that are both more effective and more predictable than simple local feedback. surrounding text:the top k terms are returned as potential expanded terms. we adopt the local context analysis [***]<1> as the second query expansion method. our method combines the global analysis and local feedback. the query in the form of concepts is used to retrieve a ranked document list from the given document collection. the terms in the top n (a threshold) documents are ranked according to the formula in [***]<1>. the top k terms are returned as the potential expanded terms of this method influence:3 type:3 pair index:963 citer id:790 citer title:using appraisal groups for sentiment analysis citer abstract:little work to date in sentiment analysis (classifying texts by positive or negative orientation) has attempted to use fine-grained semantic distinctions in features used for classification. we present a new method for sentiment classification based on extracting and analyzing appraisal groups such as very good or not terribly funny. an appraisal group is represented as a set of attribute values in several task-independent semantic taxonomies, based on appraisal theory. semi-automated methods were used to build a lexicon of appraising adjectives and their modifiers. we classify movie reviews using features based upon these taxonomies combined with standard bag-of-words features, and report state-of-the-art accuracy of 90.2%. in addition, we find that some types of appraisal appear to be more significant for sentiment classification than others citee id:346 citee title:conjunction and modal assessment in genre classification citee abstract:we use textual features motivated by systemic functional linguistic theory for genre-based text categorization. we have developed feature sets representing different types of conjunctions and modal assessment, which together indicate (partially) how different genres structure texts and express attitudes towards propositions in the text. using such features enables analysis of large-scale rhetorical differences between genres by examining which features are important for classification. the specific domain studied comprises scientific articles in historical and experimental sciences (paleontology and physical chemistry respectively). the smo learning algorithm with our feature set achieved over 83% accuracy for classifying articles according to field, though no field-specific terms were used as features. the most highly-weighted features were consistent with hypothesized methodological differences between historical and experimental sciences, thus lending empirical evidence to the notion of multiple scientific methods. surrounding text:this standard testbed consists of 1000 positive and 1000 negative reviews, taken from the imdb movie review archives6. reviews with neutral scores (such as three stars out of five) were removed by pang and lee, 5in preliminary experiments, the use of relative frequencies within each node of the taxonomy, as in [***, 20]<2>, gave inferior results to this simpler procedure. 6see http://www influence:3 type:3 pair index:964 citer id:790 citer title:using appraisal groups for sentiment analysis citer abstract:little work to date in sentiment analysis (classifying texts by positive or negative orientation) has attempted to use fine-grained semantic distinctions in features used for classification. we present a new method for sentiment classification based on extracting and analyzing appraisal groups such as very good or not terribly funny. an appraisal group is represented as a set of attribute values in several task-independent semantic taxonomies, based on appraisal theory. semi-automated methods were used to build a lexicon of appraising adjectives and their modifiers. we classify movie reviews using features based upon these taxonomies combined with standard bag-of-words features, and report state-of-the-art accuracy of 90.2%. in addition, we find that some types of appraisal appear to be more significant for sentiment classification than others citee id:175 citee title:a simple rule-based part of speech tagger citee abstract:automatic part of speech tagging is an area of natural language processing where statistical techniques have been more successful than rule- based methods. in this paper, we present a sim- ple rule-based part of speech tagger which automatically acquires its rules and tags with accuracy coinparable to stochastic taggers. the rule-based tagger has many advantages over these taggers, including: a vast reduction in stored information required, the perspicuity of a sinall set of meaningful rules surrounding text:each document is preprocessed into individual sentences and decapitalized. we used an implementation of brills [***]<1> part-of-speech tagger to help us find adjectives and modifiers. classification learning was done using wekas [23]<1> implementation of the smo [16]<1> learning algorithm, using a linear kernel and the default parameters influence:3 type:3 pair index:965 citer id:790 citer title:using appraisal groups for sentiment analysis citer abstract:little work to date in sentiment analysis (classifying texts by positive or negative orientation) has attempted to use fine-grained semantic distinctions in features used for classification. we present a new method for sentiment classification based on extracting and analyzing appraisal groups such as very good or not terribly funny. an appraisal group is represented as a set of attribute values in several task-independent semantic taxonomies, based on appraisal theory. semi-automated methods were used to build a lexicon of appraising adjectives and their modifiers. we classify movie reviews using features based upon these taxonomies combined with standard bag-of-words features, and report state-of-the-art accuracy of 90.2%. in addition, we find that some types of appraisal appear to be more significant for sentiment classification than others citee id:228 citee title:an introduction to support vector machines citee abstract:this is the first comprehensive introduction to support vector machines (svms), a new generation learning system based on recent advances in statistical learning theory. svms deliver state-of-the-art performance in real-world applications such as text categorisation, hand-written character recognition, image classification, biosequences analysis, etc., and are now established as one of the standard tools for machine learning and data mining. students will find the book both stimulating and accessible, while practitioners will be guided smoothly through the material required for a good grasp of the theory and its applications. the concepts are introduced gradually in accessible and self-contained stages, while the presentation is rigorous and thorough. pointers to relevant literature and web sites containing software ensure that it forms an ideal starting point for further study. equally, the book and its associated web site will guide practitioners to updated literature, new applications, and on-line software surrounding text:documents were appraisalattitude affect appreciationjudgementgraduation forcefocusorientation positive negative polarity marked unmarked figure 1: main attributes of appraisal and their highest-level options. then represented as vectors of relative frequency features computed over these groups and a support vector machine learning algorithm [***]<1> was used to learn a classifier discriminating positively from negatively oriented test documents. we have applied this approach to movie review classification with positive results influence:3 type:3 pair index:966 citer id:790 citer title:using appraisal groups for sentiment analysis citer abstract:little work to date in sentiment analysis (classifying texts by positive or negative orientation) has attempted to use fine-grained semantic distinctions in features used for classification. we present a new method for sentiment classification based on extracting and analyzing appraisal groups such as very good or not terribly funny. an appraisal group is represented as a set of attribute values in several task-independent semantic taxonomies, based on appraisal theory. semi-automated methods were used to build a lexicon of appraising adjectives and their modifiers. we classify movie reviews using features based upon these taxonomies combined with standard bag-of-words features, and report state-of-the-art accuracy of 90.2%. in addition, we find that some types of appraisal appear to be more significant for sentiment classification than others citee id:510 citee title:ensemble methods for automatic thesaurus extraction citee abstract:ensemble methods are state of the art for many nlp tasks. recent work by banko and brill (2001) suggests that this would not necessarily be true if very large training corpora were available. however, their results are limited by the simplicity of their evaluation task and individual classifiers.our work explores ensemble efficacy for the more complex task of automatic thesaurus extraction on up to 300 million words. we examine our conflicting results in terms of the constraints on, and complexity of, different contextual representations, which contribute to the sparseness-and noise-induced bias behaviour of nlp systems on very large corpora. surrounding text:requiring a high level manual intervention is not ideal. due to the functional nature of the groupings, system networks would appear ripe for inference using current statistical thesaurus-building techniques [***]<1>. this would also allow the construction of domain-specific ontologies, as opposed to the generic lexicon used in this paper influence:3 type:3 pair index:967 citer id:790 citer title:using appraisal groups for sentiment analysis citer abstract:little work to date in sentiment analysis (classifying texts by positive or negative orientation) has attempted to use fine-grained semantic distinctions in features used for classification. we present a new method for sentiment classification based on extracting and analyzing appraisal groups such as very good or not terribly funny. an appraisal group is represented as a set of attribute values in several task-independent semantic taxonomies, based on appraisal theory. semi-automated methods were used to build a lexicon of appraising adjectives and their modifiers. we classify movie reviews using features based upon these taxonomies combined with standard bag-of-words features, and report state-of-the-art accuracy of 90.2%. in addition, we find that some types of appraisal appear to be more significant for sentiment classification than others citee id:629 citee title:mining the peanut gallery: opinion extraction and semantic classification of product reviews citee abstract:the web contains a wealth of product reviews, but sifting through them is a daunting task. ideally, an opinion mining tool would process a set of search results for a given item, generating a list of product attributes (quality, features, etc.) and aggregating opinions about each of them (poor, mixed, good). we begin by identifying the unique properties of this problem and develop a method for automatically distinguishing between positive and negative reviews. our classifier draws on information retrieval techniques for feature extraction and scoring, and the results for various metrics and heuristics vary depending on the testing situation. the best methods work as well as or better than traditional machine learning. when operating on individual sentences collected from web searches, performance is limited due to noise and ambiguity. but in the context of a complete web-based tool and aided by a simple method for grouping sentences into attributes, the results are qualitatively quite useful. surrounding text:previous work on including polarity (good vs. not good) have given inconsistent resultseither a slight improvement [15]<2> or decrease [***]<2> from bag-of-word baselines. our results show it to help slightly influence:2 type:2 pair index:968 citer id:790 citer title:using appraisal groups for sentiment analysis citer abstract:little work to date in sentiment analysis (classifying texts by positive or negative orientation) has attempted to use fine-grained semantic distinctions in features used for classification. we present a new method for sentiment classification based on extracting and analyzing appraisal groups such as very good or not terribly funny. an appraisal group is represented as a set of attribute values in several task-independent semantic taxonomies, based on appraisal theory. semi-automated methods were used to build a lexicon of appraising adjectives and their modifiers. we classify movie reviews using features based upon these taxonomies combined with standard bag-of-words features, and report state-of-the-art accuracy of 90.2%. in addition, we find that some types of appraisal appear to be more significant for sentiment classification than others citee id:459 citee title:effects of adjective orientation and gradability on sentence subjectivity citee abstract:subjectivity is a pragmatic, sentence-level feature that has important implications for texl processing applicalions such as information exlractiou and information iclricwd. we study tile elfeels of dymunic adjectives, semantically oriented adjectives, and gradable ad.ieclivcs on a simple subjectivity classiiicr, and establish lhat lhcy arc strong predictors of subjectivity. a novel trainable mclhod thai statistically combines two indicators of gradability is presented and ewlhlalcd, complementing exisling automatic icchniques for assigning orientation labels. surrounding text:subjective use of verbs, adjectives and multi-word expressions can be learnt automatically and used to detect sentence-level subjectivity [21]<2>. adjectives play a strong role in subjective language, especially the class of gradable adjectives [***]<2> that can take modifiers such as very. more generally, wilson et al influence:3 type:2 pair index:969 citer id:790 citer title:using appraisal groups for sentiment analysis citer abstract:little work to date in sentiment analysis (classifying texts by positive or negative orientation) has attempted to use fine-grained semantic distinctions in features used for classification. we present a new method for sentiment classification based on extracting and analyzing appraisal groups such as very good or not terribly funny. an appraisal group is represented as a set of attribute values in several task-independent semantic taxonomies, based on appraisal theory. semi-automated methods were used to build a lexicon of appraising adjectives and their modifiers. we classify movie reviews using features based upon these taxonomies combined with standard bag-of-words features, and report state-of-the-art accuracy of 90.2%. in addition, we find that some types of appraisal appear to be more significant for sentiment classification than others citee id:859 citee title:the rhetorical parsing of unrestricted texts: a surface-based approach citee abstract:coherent texts are not just simple sequences of clauses and sentences, but rather complex artifacts that have highly elaborate rhetorical structure. this paper explores the extent to which well-formed rhetorical structures can be automatically derived by means of surface-form-based algorithms. these algorithms identify discourse usages of cue phrases and break sentences into clauses, hypothesize rhetorical relations that hold among textual units, and produce valid rhetorical structure trees for unrestricted natural language texts. the algorithms are empirically grounded in a corpus analysis of cue phrases and rely on a first-order formalization of rhetorical structure trees.the algorithms are evaluated both intrinsically and extrinsically. the intrinsic evaluation assesses the resemblance between automatically and manually constructed rhetorical structure trees. the extrinsic evaluation shows that automatically derived rhetorical structures can be successfully exploited in the context of text summarization surrounding text:this is equivalent, in our approach, to using just orientation values, computing a weighted sum 8curiously, and, also, and as are strong features for positive sentiment. this may indicate that rhetorical structure [***, 1]<3> is also important for understanding sentiment. comparison base s:a s:ao g:a g:ao g:aof bow bow + g:ao bow + g:aof w:a +0 influence:3 type:3 pair index:970 citer id:790 citer title:using appraisal groups for sentiment analysis citer abstract:little work to date in sentiment analysis (classifying texts by positive or negative orientation) has attempted to use fine-grained semantic distinctions in features used for classification. we present a new method for sentiment classification based on extracting and analyzing appraisal groups such as very good or not terribly funny. an appraisal group is represented as a set of attribute values in several task-independent semantic taxonomies, based on appraisal theory. semi-automated methods were used to build a lexicon of appraising adjectives and their modifiers. we classify movie reviews using features based upon these taxonomies combined with standard bag-of-words features, and report state-of-the-art accuracy of 90.2%. in addition, we find that some types of appraisal appear to be more significant for sentiment classification than others citee id:845 citee title:sentiment analysis using support vector machines with diverse information sources citee abstract:this paper introduces an approach to sentiment analysis which uses support vector machines (svms) to bring together diverse sources of potentially pertinent information, including several favorability measures for phrases and adjectives and, where available, knowledge of the topic of the text. models using the features introduced are further combined with unigram models which have been shown to be effective in the past (pang et al., 2002) and lemmatized versions of the unigram models. experiments on movie review data from the internet movie database demonstrate that hybrid svms which combine unigram-style feature-based svms with those based on real-valued favorability measures obtain superior performance, producing the best results yet published using this data. further experiments using a feature set enriched with topic information on a smaller dataset of music reviews hand-annotated for topic are also reported, the results of which suggest that incorporating topic information into such models may also yield improvement. surrounding text:for comparison, p&l-04 denotes the best results obtained on this data set by pang and lee [14]<2>. m&c-03 denotes results obtained on the earlier movie review dataset by mullen and collier [***]<2> for two feature sets. giving a data set with only clearly positive and negative reviews7. directly comparable is the highest previous accuracy for this dataset, attained by pang and lee [14]<2> via a complex combination of subjectivity clustering and bag-of-words classification for sentiment analysis. we also show two results from mullen and collier [***]<2>, the only previous work we are aware of to use something akin to attitude type for sentiment analysis. unfortunately, their results are not directly comparable with ours, as they used 7this lack of inconclusive documents may limit the realworld applicability of results on this dataset. early attempts at classifying movie reviews used standard bag-of-words techniques with limited success [15]<2>. the addition of typed features of semantic orientation has been shown to improve results [***]<2>. semantic orientation has also been useful for classifying more general product reviews [19]<2>. there have been some previous attempts at using a more structured linguistic analysis of text for sentiment classification, with mixed results. mullen and collier [***]<2> produced features based on osgoods theory of semantic differentiation, using wordnet to judge the potency, activity, and evaluative factors for adjectives. using these features did not yield any reliable benefit, although it is unclear whether this is due to the theory or to its implementation influence:1 type:2 pair index:971 citer id:790 citer title:using appraisal groups for sentiment analysis citer abstract:little work to date in sentiment analysis (classifying texts by positive or negative orientation) has attempted to use fine-grained semantic distinctions in features used for classification. we present a new method for sentiment classification based on extracting and analyzing appraisal groups such as very good or not terribly funny. an appraisal group is represented as a set of attribute values in several task-independent semantic taxonomies, based on appraisal theory. semi-automated methods were used to build a lexicon of appraising adjectives and their modifiers. we classify movie reviews using features based upon these taxonomies combined with standard bag-of-words features, and report state-of-the-art accuracy of 90.2%. in addition, we find that some types of appraisal appear to be more significant for sentiment classification than others citee id:115 citee title:sentiment analysis: capturing favorability using natural language processing citee abstract:this paper illustrates a sentiment analysis approach to extract sentiments associated with polarities of positive or negative for specific subjects from a document, instead of classifying the whole document into positive or negative.the essential issues in sentiment analysis are to identify how sentiments are expressed in texts and whether the expressions indicate positive (favorable) or negative (unfavorable) opinions toward the subject. in order to improve the accuracy of the sentiment analysis, it is important to properly identify the semantic relationships between the sentiment expressions and the subject. by applying semantic analysis with a syntactic parser and sentiment lexicon, our prototype system achieved high precision (75-95%, depending on the data) in finding sentiments within web pages and news articles. surrounding text:the same study used a small corpus of music reviews manually annotated for artist and work, and showed that knowing the appraised can increase performance. nasukawa and yi [***]<2> use pos-tagging, chunking, and manual templates to investigate sentiment verbs and sentiment transfer verbs and have shown that this approach can aid high-precision analysis. previous work on including polarity (good vs influence:1 type:2 pair index:972 citer id:790 citer title:using appraisal groups for sentiment analysis citer abstract:little work to date in sentiment analysis (classifying texts by positive or negative orientation) has attempted to use fine-grained semantic distinctions in features used for classification. we present a new method for sentiment classification based on extracting and analyzing appraisal groups such as very good or not terribly funny. an appraisal group is represented as a set of attribute values in several task-independent semantic taxonomies, based on appraisal theory. semi-automated methods were used to build a lexicon of appraising adjectives and their modifiers. we classify movie reviews using features based upon these taxonomies combined with standard bag-of-words features, and report state-of-the-art accuracy of 90.2%. in addition, we find that some types of appraisal appear to be more significant for sentiment classification than others citee id:174 citee title:a sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts citee abstract:sentiment analysis seeks to identify the viewpoint(s) underlying a text span; an example application is classifying a movie review as "thumbs up" or "thumbs down". to determine this sentiment polarity, we propose a novel machine-learning method that applies text-categorization techniques to just the subjective portions of the document. extracting these portions can be implemented using efficient techniques for finding minimum cuts in graphs; this greatly facilitates incorporation of cross-sentence contextual constraints surrounding text:experiments 4. 1 methodology to test the usefulness of adjectival appraisal groups for sentiment analysis, we evaluated the effectiveness of the above feature sets for movie review classification, using the publicly available collection of movie reviews constructed by pang and lee [***]<3>. this standard testbed consists of 1000 positive and 1000 negative reviews, taken from the imdb movie review archives6 influence:1 type:2 pair index:973 citer id:790 citer title:using appraisal groups for sentiment analysis citer abstract:little work to date in sentiment analysis (classifying texts by positive or negative orientation) has attempted to use fine-grained semantic distinctions in features used for classification. we present a new method for sentiment classification based on extracting and analyzing appraisal groups such as very good or not terribly funny. an appraisal group is represented as a set of attribute values in several task-independent semantic taxonomies, based on appraisal theory. semi-automated methods were used to build a lexicon of appraising adjectives and their modifiers. we classify movie reviews using features based upon these taxonomies combined with standard bag-of-words features, and report state-of-the-art accuracy of 90.2%. in addition, we find that some types of appraisal appear to be more significant for sentiment classification than others citee id:118 citee title:thumbs up? sentiment classification using machine learning techniques citee abstract:we consider the problem of classifying documents not by topic, but by overall sentiment, e.g., determining whether a review is positive or negative. using movie reviews as data, we find that standard machine learning techniques definitively outperform human-produced baselines. however, the three machine learning methods we employed (naive bayes, maximum entropy classification, and support vector machines) do not perform as well on sentiment classification as on traditional topic-based categorization. we conclude by examining factors that make the sentiment classification problem more challenging. surrounding text:for the whole document. early attempts at classifying movie reviews used standard bag-of-words techniques with limited success [***]<2>. the addition of typed features of semantic orientation has been shown to improve results [12]<2>. previous work on including polarity (good vs. not good) have given inconsistent resultseither a slight improvement [***]<2> or decrease [5]<2> from bag-of-word baselines. our results show it to help slightly influence:1 type:2 pair index:974 citer id:790 citer title:using appraisal groups for sentiment analysis citer abstract:little work to date in sentiment analysis (classifying texts by positive or negative orientation) has attempted to use fine-grained semantic distinctions in features used for classification. we present a new method for sentiment classification based on extracting and analyzing appraisal groups such as very good or not terribly funny. an appraisal group is represented as a set of attribute values in several task-independent semantic taxonomies, based on appraisal theory. semi-automated methods were used to build a lexicon of appraising adjectives and their modifiers. we classify movie reviews using features based upon these taxonomies combined with standard bag-of-words features, and report state-of-the-art accuracy of 90.2%. in addition, we find that some types of appraisal appear to be more significant for sentiment classification than others citee id:561 citee title:fast training of support vector machines using sequential minimal optimization citee abstract:this chapter describes a new algorithm for training support vector machines: sequential minimal optimization, or smo. training a support vector machine (svm) requires the solution of a very large quadratic programming (qp) optimization problem. smo breaks this large qp problem into a series of smallest possible qp problems. these small qp problems are solved analytically, which avoids using a time-consuming numerical qp optimization as an inner loop. the amount of memory required for smo is linear in the training set size, which allows smo to handle very large training sets. because large matrix computation is avoided, smo scales somewhere between linear and quadratic in the training set size for various test problems, while a standard projected conjugate gradient (pcg) chunking algorithm scales somewhere between linear and cubic in the training set size. smo's computation time is dominated by svm evaluation, hence smo is fastest for linear svms and sparse data sets. for the mnist database, smo is as fast as pcg chunking; while for the uci adult database and linear svms, smo can be more than 1000 times faster than the pcg chunking algorithm. surrounding text:we used an implementation of brills [2]<1> part-of-speech tagger to help us find adjectives and modifiers. classification learning was done using wekas [23]<1> implementation of the smo [***]<1> learning algorithm, using a linear kernel and the default parameters. it is possible that better results might be obtained by using a different kernel and tuning the parameter values, however such experiments would have little validity, due to the small size of the current testbed corpus influence:3 type:3,1 pair index:975 citer id:790 citer title:using appraisal groups for sentiment analysis citer abstract:little work to date in sentiment analysis (classifying texts by positive or negative orientation) has attempted to use fine-grained semantic distinctions in features used for classification. we present a new method for sentiment classification based on extracting and analyzing appraisal groups such as very good or not terribly funny. an appraisal group is represented as a set of attribute values in several task-independent semantic taxonomies, based on appraisal theory. semi-automated methods were used to build a lexicon of appraising adjectives and their modifiers. we classify movie reviews using features based upon these taxonomies combined with standard bag-of-words features, and report state-of-the-art accuracy of 90.2%. in addition, we find that some types of appraisal appear to be more significant for sentiment classification than others citee id:234 citee title:analyzing appraisal automatically citee abstract:we present a method for classifying texts automatically,based on their subjective content. we apply a stan-dard method for calculating semantic orientation (tur-ney 2002), and expand it by giving more prominence tocertain parts of the text, where we believe most subjec-tive content is concentrated. we also apply a linguisticclassification of appraisal and find that it could be help-ful in distinguishing different types of subjective texts(e.g., movie reviews from consumer product reviews). surrounding text:the other main approach (semantic orientation) classifies words (usually automatically) into two classes, good and bad, and then computes an overall good/bad score for the text, however, such approaches miss important aspects of the task. first, a more detailed semantic analysis of attitude expressions is needed, in the form of a well-designed taxonomy of attitude types and other semantic properties (as noted by taboada and grieve [***]<2>). second, the atomic units of such expressions are not individual words, but rather appraisal groups: coherent groups of words that express together a particular attitude, such as extremely boring, or not really very good. from this perspective, most previous sentiment classification research has focused exclusively upon orientation, with attitude type addressed only indirectly, through the use of bag-of-words features. an exception is taboada and grieves [***]<2> method of automatically determining top-level attitude types via application of of turneys pmi method [19]<2>. they observed that different types of reviews contain different amounts of each attitude-type influence:3 type:2 pair index:976 citer id:790 citer title:using appraisal groups for sentiment analysis citer abstract:little work to date in sentiment analysis (classifying texts by positive or negative orientation) has attempted to use fine-grained semantic distinctions in features used for classification. we present a new method for sentiment classification based on extracting and analyzing appraisal groups such as very good or not terribly funny. an appraisal group is represented as a set of attribute values in several task-independent semantic taxonomies, based on appraisal theory. semi-automated methods were used to build a lexicon of appraising adjectives and their modifiers. we classify movie reviews using features based upon these taxonomies combined with standard bag-of-words features, and report state-of-the-art accuracy of 90.2%. in addition, we find that some types of appraisal appear to be more significant for sentiment classification than others citee id:637 citee title:identifying interpersonal distance using systemic features citee abstract:this chapter uses systemic functional linguistic (sfl) theory as a basis for extracting semantic features of documents. we focus on the pronominal and determination system and the role it plays in constructing interpersonal distance. by using a hierarchical system model that represents the authors language choices, it is possible to construct a richer and more informative feature representation with superior computational efficiency than the usual bag-of-words approach. experiments within the context of financial scam classification show that these systemic features can create clear separation between registers with different interpersonal distance. this approach is generalizable to other aspects of attitude and affect that have been modelled within the systemic functional linguistic theory. surrounding text:this standard testbed consists of 1000 positive and 1000 negative reviews, taken from the imdb movie review archives6. reviews with neutral scores (such as three stars out of five) were removed by pang and lee, 5in preliminary experiments, the use of relative frequencies within each node of the taxonomy, as in [1, ***]<2>, gave inferior results to this simpler procedure. 6see http://www influence:3 type:3 pair index:977 citer id:796 citer title:opinion spam and analysis citer abstract:evaluative texts on the web have become a valuable source of opinions on products, services, events, individuals, etc. recently, many researchers have studied such opinion sources as product reviews, forum posts, and blogs. however, existing research has been focused on classification and summarization of opinions using natural language processing and data mining techniques. an important issue that has been neglected so far is opinion spam or trustworthiness of online opinions. in this paper, we study this issue in the context of product reviews, which are opinion rich and are widely used by consumers and product manufacturers. in the past two years, several startup companies also appeared which aggregate opinions from product reviews. it is thus high time to study spam in reviews. to the best of our knowledge, there is still no published study on this topic, although web spam and email spam have been investigated extensively. we will see that opinion spam is quite different from web spam and email spam, and thus requires different detection techniques. based on the analysis of 5.8 million reviews and 2.14 million reviewers from amazon.com, we show that opinion spam in reviews is widespread. this paper analyzes such spam activities and presents some novel techniques to detect them citee id:776 citee title:on the resemblance and containment of documents citee abstract:given two documents a and b we de ne two mathematical notions: their resemblance r(a; b) and their containment c(a; b) that seem to capture well the informal notions of \roughly the same" and \roughly containedfi " the basic idea is to reduce these issues to set intersection problems that can be easily evaluated by a process of random sampling that can be done inde-pendently for each documentfi furthermore, the resemblance can be evaluated using a xed size sample for each documentfi this paper discusses the mathematical properties of these measures and the e cient implementation of the sampling process using rabin ngerprintsfi surrounding text:4. 1 detection of duplicate reviews duplicate and near-duplicate (not exact copy) reviews can be detected using the shingle method in [***]<2>. in this work, we use 2gram based review content comparison influence:3 type:3 pair index:978 citer id:796 citer title:opinion spam and analysis citer abstract:evaluative texts on the web have become a valuable source of opinions on products, services, events, individuals, etc. recently, many researchers have studied such opinion sources as product reviews, forum posts, and blogs. however, existing research has been focused on classification and summarization of opinions using natural language processing and data mining techniques. an important issue that has been neglected so far is opinion spam or trustworthiness of online opinions. in this paper, we study this issue in the context of product reviews, which are opinion rich and are widely used by consumers and product manufacturers. in the past two years, several startup companies also appeared which aggregate opinions from product reviews. it is thus high time to study spam in reviews. to the best of our knowledge, there is still no published study on this topic, although web spam and email spam have been investigated extensively. we will see that opinion spam is quite different from web spam and email spam, and thus requires different detection techniques. based on the analysis of 5.8 million reviews and 2.14 million reviewers from amazon.com, we show that opinion spam in reviews is widespread. this paper analyzes such spam activities and presents some novel techniques to detect them citee id:629 citee title:mining the peanut gallery: opinion extraction and semantic classification of product reviews citee abstract:the web contains a wealth of product reviews, but sifting through them is a daunting task. ideally, an opinion mining tool would process a set of search results for a given item, generating a list of product attributes (quality, features, etc.) and aggregating opinions about each of them (poor, mixed, good). we begin by identifying the unique properties of this problem and develop a method for automatically distinguishing between positive and negative reviews. our classifier draws on information retrieval techniques for feature extraction and scoring, and the results for various metrics and heuristics vary depending on the testing situation. the best methods work as well as or better than traditional machine learning. when operating on individual sentences collected from web searches, performance is limited due to noise and ambiguity. but in the context of a complete web-based tool and aided by a simple method for grouping sentences into attributes, the results are qualitatively quite useful. surrounding text:related work analysis of on-line opinions became a popular research topic recently. as we mentioned in the previous section, current studies are mainly focused on mining opinions in reviews and/or classify reviews as positive or negative based on the sentiments of the reviewers [***, 12, 16, 20, 21, 23]<2>. this paper focuses on studying opinion spam activities in reviews influence:2 type:2 pair index:979 citer id:796 citer title:opinion spam and analysis citer abstract:evaluative texts on the web have become a valuable source of opinions on products, services, events, individuals, etc. recently, many researchers have studied such opinion sources as product reviews, forum posts, and blogs. however, existing research has been focused on classification and summarization of opinions using natural language processing and data mining techniques. an important issue that has been neglected so far is opinion spam or trustworthiness of online opinions. in this paper, we study this issue in the context of product reviews, which are opinion rich and are widely used by consumers and product manufacturers. in the past two years, several startup companies also appeared which aggregate opinions from product reviews. it is thus high time to study spam in reviews. to the best of our knowledge, there is still no published study on this topic, although web spam and email spam have been investigated extensively. we will see that opinion spam is quite different from web spam and email spam, and thus requires different detection techniques. based on the analysis of 5.8 million reviews and 2.14 million reviewers from amazon.com, we show that opinion spam in reviews is widespread. this paper analyzes such spam activities and presents some novel techniques to detect them citee id:797 citee title:web spam taxonomy citee abstract:web spamming refers to actions intended to mislead search engines into ranking some pages higher than they deserve. recently, the amount of web spam has increased dramatically, leading to a degradation of search results. this paper presents a comprehensive taxonomy of current spamming techniques, which we believe can help in developing appropriate countermeasures. surrounding text:review spam is similar to web page spam. in the context of web search, due to the economic and/or publicity value of the rank position of a page returned by a search engine, web page spam is widespread [3, 5, ***, 24, 25, 26]<3>. web page spam refers to the use of illegitimate means to boost the rank positions of some target pages in search engines [***, 19]<3>. in the context of web search, due to the economic and/or publicity value of the rank position of a page returned by a search engine, web page spam is widespread [3, 5, ***, 24, 25, 26]<3>. web page spam refers to the use of illegitimate means to boost the rank positions of some target pages in search engines [***, 19]<3>. in the context of reviews, the problem is similar, but also quite different influence:2 type:2 pair index:980 citer id:796 citer title:opinion spam and analysis citer abstract:evaluative texts on the web have become a valuable source of opinions on products, services, events, individuals, etc. recently, many researchers have studied such opinion sources as product reviews, forum posts, and blogs. however, existing research has been focused on classification and summarization of opinions using natural language processing and data mining techniques. an important issue that has been neglected so far is opinion spam or trustworthiness of online opinions. in this paper, we study this issue in the context of product reviews, which are opinion rich and are widely used by consumers and product manufacturers. in the past two years, several startup companies also appeared which aggregate opinions from product reviews. it is thus high time to study spam in reviews. to the best of our knowledge, there is still no published study on this topic, although web spam and email spam have been investigated extensively. we will see that opinion spam is quite different from web spam and email spam, and thus requires different detection techniques. based on the analysis of 5.8 million reviews and 2.14 million reviewers from amazon.com, we show that opinion spam in reviews is widespread. this paper analyzes such spam activities and presents some novel techniques to detect them citee id:403 citee title:mining and summarizing customer reviews citee abstract:merchants selling products on the web often ask their customers to review the products that they have purchased and the associated services. as e-commerce is becoming more and more popular, the number of customer reviews that a product receives grows rapidly. for a popular product, the number of reviews can be in hundreds or even thousands. this makes it difficult for a potential customer to read them to make an informed decision on whether to purchase the product. it also makes it difficult for the manufacturer of the product to keep track and to manage customer opinions. for the manufacturer, there are additional difficulties because many merchant sites may sell the same product and the manufacturer normally produces many kinds of products. in this research, we aim to mine and to summarize all the customer reviews of a product. this summarization task is different from traditional text summarization because we only mine the features of the product on which the customers have expressed their opinions and whether the opinions are positive or negative. we do not summarize the reviews by selecting a subset or rewrite some of the original sentences from the reviews to capture the main points as in the classic text summarization. our task is performed in three steps: (1) mining product features that have been commented on by customers; (2) identifying opinion sentences in each review and deciding whether each opinion sentence is positive or negative; (3) summarizing the results. this paper proposes several novel techniques to perform these tasks. our experimental results using reviews of a number of products sold online demonstrate the effectiveness of the techniques surrounding text:they are used by potential customers to find opinions of existing users before deciding to purchase a product. they are also used by product manufacturers to identify product problems and/or to find marketing intelligence information about their competitors [***, 16]<3>. in the past few years, there was a growing interest in mining opinions in reviews from both academia and industry influence:2 type:2 pair index:981 citer id:796 citer title:opinion spam and analysis citer abstract:evaluative texts on the web have become a valuable source of opinions on products, services, events, individuals, etc. recently, many researchers have studied such opinion sources as product reviews, forum posts, and blogs. however, existing research has been focused on classification and summarization of opinions using natural language processing and data mining techniques. an important issue that has been neglected so far is opinion spam or trustworthiness of online opinions. in this paper, we study this issue in the context of product reviews, which are opinion rich and are widely used by consumers and product manufacturers. in the past two years, several startup companies also appeared which aggregate opinions from product reviews. it is thus high time to study spam in reviews. to the best of our knowledge, there is still no published study on this topic, although web spam and email spam have been investigated extensively. we will see that opinion spam is quite different from web spam and email spam, and thus requires different detection techniques. based on the analysis of 5.8 million reviews and 2.14 million reviewers from amazon.com, we show that opinion spam in reviews is widespread. this paper analyzes such spam activities and presents some novel techniques to detect them citee id:409 citee title:detecting spam web pages through content analysis citee abstract:in this paper, we continue our investigations of "web spam": the injection of artificially-created pages into the web in order to influence the results from search engines, to drive traffic to certain pages for fun or profit. this paper considers some previously-undescribed techniques for automatically detecting spam pages, examines the effectiveness of these techniques in isolation and when aggregated using classification algorithms. when combined, our heuristics correctly identify 2,037 (86.2%) of the 2,364 spam pages (13.8%) in our judged collection of 17,168 pages, while misidentifying 526 spam and non-spam pages (3.1%). surrounding text:in the context of web search, due to the economic and/or publicity value of the rank position of a page returned by a search engine, web page spam is widespread [3, 5, 10, 24, 25, 26]<3>. web page spam refers to the use of illegitimate means to boost the rank positions of some target pages in search engines [10, ***]<3>. in the context of reviews, the problem is similar, but also quite different influence:2 type:2 pair index:982 citer id:796 citer title:opinion spam and analysis citer abstract:evaluative texts on the web have become a valuable source of opinions on products, services, events, individuals, etc. recently, many researchers have studied such opinion sources as product reviews, forum posts, and blogs. however, existing research has been focused on classification and summarization of opinions using natural language processing and data mining techniques. an important issue that has been neglected so far is opinion spam or trustworthiness of online opinions. in this paper, we study this issue in the context of product reviews, which are opinion rich and are widely used by consumers and product manufacturers. in the past two years, several startup companies also appeared which aggregate opinions from product reviews. it is thus high time to study spam in reviews. to the best of our knowledge, there is still no published study on this topic, although web spam and email spam have been investigated extensively. we will see that opinion spam is quite different from web spam and email spam, and thus requires different detection techniques. based on the analysis of 5.8 million reviews and 2.14 million reviewers from amazon.com, we show that opinion spam in reviews is widespread. this paper analyzes such spam activities and presents some novel techniques to detect them citee id:118 citee title:thumbs up? sentiment classification using machine learning techniques citee abstract:we consider the problem of classifying documents not by topic, but by overall sentiment, e.g., determining whether a review is positive or negative. using movie reviews as data, we find that standard machine learning techniques definitively outperform human-produced baselines. however, the three machine learning methods we employed (naive bayes, maximum entropy classification, and support vector machines) do not perform as well on sentiment classification as on traditional topic-based categorization. we conclude by examining factors that make the sentiment classification problem more challenging. surrounding text:related work analysis of on-line opinions became a popular research topic recently. as we mentioned in the previous section, current studies are mainly focused on mining opinions in reviews and/or classify reviews as positive or negative based on the sentiments of the reviewers [7, 12, 16, ***, 21, 23]<2>. this paper focuses on studying opinion spam activities in reviews influence:2 type:2 pair index:983 citer id:796 citer title:opinion spam and analysis citer abstract:evaluative texts on the web have become a valuable source of opinions on products, services, events, individuals, etc. recently, many researchers have studied such opinion sources as product reviews, forum posts, and blogs. however, existing research has been focused on classification and summarization of opinions using natural language processing and data mining techniques. an important issue that has been neglected so far is opinion spam or trustworthiness of online opinions. in this paper, we study this issue in the context of product reviews, which are opinion rich and are widely used by consumers and product manufacturers. in the past two years, several startup companies also appeared which aggregate opinions from product reviews. it is thus high time to study spam in reviews. to the best of our knowledge, there is still no published study on this topic, although web spam and email spam have been investigated extensively. we will see that opinion spam is quite different from web spam and email spam, and thus requires different detection techniques. based on the analysis of 5.8 million reviews and 2.14 million reviewers from amazon.com, we show that opinion spam in reviews is widespread. this paper analyzes such spam activities and presents some novel techniques to detect them citee id:119 citee title:extracting product features and opinions from reviews citee abstract:consumers are often forced to wade through many on-line reviews in order to make an informed product choice. this paper introduces opine, an unsupervised information-extraction system which mines reviews in order to build a model of important product features, their evaluation by reviewers, and their relative quality across products.compared to previous work, opine achieves 22% higher precision (with only 3% lower recall) on the feature extraction task. opine's novel use of relaxation labeling for finding the semantic orientation of words in context leads to strong performance on the tasks of finding opinion phrases and their polarity. surrounding text:related work analysis of on-line opinions became a popular research topic recently. as we mentioned in the previous section, current studies are mainly focused on mining opinions in reviews and/or classify reviews as positive or negative based on the sentiments of the reviewers [7, 12, 16, 20, ***, 23]<2>. this paper focuses on studying opinion spam activities in reviews influence:2 type:2 pair index:984 citer id:796 citer title:opinion spam and analysis citer abstract:evaluative texts on the web have become a valuable source of opinions on products, services, events, individuals, etc. recently, many researchers have studied such opinion sources as product reviews, forum posts, and blogs. however, existing research has been focused on classification and summarization of opinions using natural language processing and data mining techniques. an important issue that has been neglected so far is opinion spam or trustworthiness of online opinions. in this paper, we study this issue in the context of product reviews, which are opinion rich and are widely used by consumers and product manufacturers. in the past two years, several startup companies also appeared which aggregate opinions from product reviews. it is thus high time to study spam in reviews. to the best of our knowledge, there is still no published study on this topic, although web spam and email spam have been investigated extensively. we will see that opinion spam is quite different from web spam and email spam, and thus requires different detection techniques. based on the analysis of 5.8 million reviews and 2.14 million reviewers from amazon.com, we show that opinion spam in reviews is widespread. this paper analyzes such spam activities and presents some novel techniques to detect them citee id:798 citee title:spam double-funnel: connecting web spammers with advertisers citee abstract:spammers use questionable search engine optimization (seo) techniques to promote their spam links into top search results. in this paper, we focus on one prevalent type of spam c redirection spam c where one can identify spam pages by the third-party domains that these pages redirect traffic to. we propose a fivelayer, double-funnel model for describing end-to-end redirection spam, present a methodology for analyzing the layers, and identify prominent domains on each layer using two sets of commercial keywords c one targeting spammers and the other targeting advertisers. the methodology and findings are useful for search engines to strengthen their ranking algorithms against spam, for legitimate website owners to locate and remove spam doorway pages, and for legitimate advertisers to identify unscrupulous syndicators who serve ads on spam pages. surrounding text:review spam is similar to web page spam. in the context of web search, due to the economic and/or publicity value of the rank position of a page returned by a search engine, web page spam is widespread [3, 5, 10, ***, 25, 26]<3>. web page spam refers to the use of illegitimate means to boost the rank positions of some target pages in search engines [10, 19]<3> influence:2 type:2 pair index:985 citer id:796 citer title:opinion spam and analysis citer abstract:evaluative texts on the web have become a valuable source of opinions on products, services, events, individuals, etc. recently, many researchers have studied such opinion sources as product reviews, forum posts, and blogs. however, existing research has been focused on classification and summarization of opinions using natural language processing and data mining techniques. an important issue that has been neglected so far is opinion spam or trustworthiness of online opinions. in this paper, we study this issue in the context of product reviews, which are opinion rich and are widely used by consumers and product manufacturers. in the past two years, several startup companies also appeared which aggregate opinions from product reviews. it is thus high time to study spam in reviews. to the best of our knowledge, there is still no published study on this topic, although web spam and email spam have been investigated extensively. we will see that opinion spam is quite different from web spam and email spam, and thus requires different detection techniques. based on the analysis of 5.8 million reviews and 2.14 million reviewers from amazon.com, we show that opinion spam in reviews is widespread. this paper analyzes such spam activities and presents some novel techniques to detect them citee id:638 citee title:identifying link farm spam pages citee abstract:with the increasing importance of search in guiding today's web traffic, more and more effort has been spent to create search engine spam. since link analysis is one of the most important factors in current commercial search engines' ranking systems, new kinds of spam aiming at links have appeared. building link farms is one technique that can deteriorate link-based ranking algorithms. in this paper, we present algorithms for detecting these link farms automatically by first generating a see surrounding text:review spam is similar to web page spam. in the context of web search, due to the economic and/or publicity value of the rank position of a page returned by a search engine, web page spam is widespread [3, 5, 10, 24, ***, 26]<3>. web page spam refers to the use of illegitimate means to boost the rank positions of some target pages in search engines [10, 19]<3> influence:2 type:2 pair index:986 citer id:796 citer title:opinion spam and analysis citer abstract:evaluative texts on the web have become a valuable source of opinions on products, services, events, individuals, etc. recently, many researchers have studied such opinion sources as product reviews, forum posts, and blogs. however, existing research has been focused on classification and summarization of opinions using natural language processing and data mining techniques. an important issue that has been neglected so far is opinion spam or trustworthiness of online opinions. in this paper, we study this issue in the context of product reviews, which are opinion rich and are widely used by consumers and product manufacturers. in the past two years, several startup companies also appeared which aggregate opinions from product reviews. it is thus high time to study spam in reviews. to the best of our knowledge, there is still no published study on this topic, although web spam and email spam have been investigated extensively. we will see that opinion spam is quite different from web spam and email spam, and thus requires different detection techniques. based on the analysis of 5.8 million reviews and 2.14 million reviewers from amazon.com, we show that opinion spam in reviews is widespread. this paper analyzes such spam activities and presents some novel techniques to detect them citee id:799 citee title:topical trustrank: using topicality to combat web spam citee abstract:web spam is behavior that attempts to deceive search engine ranking algorithms. trustrank is a recent algorithm that can combat web spam. however, trustrank is vulnerable in the sense that the seed set used by trustrank may not be sufficiently representative to cover well the different topics on the web. also, for a given seed set, trustrank has a bias towards larger communities. we propose the use of topical information to partition the seed set and calculate trust scores for each topic separately to address the above issues. a combination of these trust scores for a page is used to determine its ranking. experimental results on two large datasets show that our topical trustrank has a better performance than trustrank in demoting spam sites or pages. compared to trustrank, our best technique can decrease spam from the top ranked sites by as much as 43.1% surrounding text:review spam is similar to web page spam. in the context of web search, due to the economic and/or publicity value of the rank position of a page returned by a search engine, web page spam is widespread [3, 5, 10, 24, 25, ***]<3>. web page spam refers to the use of illegitimate means to boost the rank positions of some target pages in search engines [10, 19]<3> influence:2 type:2 pair index:987 citer id:796 citer title:opinion spam and analysis citer abstract:evaluative texts on the web have become a valuable source of opinions on products, services, events, individuals, etc. recently, many researchers have studied such opinion sources as product reviews, forum posts, and blogs. however, existing research has been focused on classification and summarization of opinions using natural language processing and data mining techniques. an important issue that has been neglected so far is opinion spam or trustworthiness of online opinions. in this paper, we study this issue in the context of product reviews, which are opinion rich and are widely used by consumers and product manufacturers. in the past two years, several startup companies also appeared which aggregate opinions from product reviews. it is thus high time to study spam in reviews. to the best of our knowledge, there is still no published study on this topic, although web spam and email spam have been investigated extensively. we will see that opinion spam is quite different from web spam and email spam, and thus requires different detection techniques. based on the analysis of 5.8 million reviews and 2.14 million reviewers from amazon.com, we show that opinion spam in reviews is widespread. this paper analyzes such spam activities and presents some novel techniques to detect them citee id:800 citee title:utility scoring of product reviews citee abstract:we identify a new task in the ongoing research in text sentiment analysis: predicting utility of product reviews, which is orthogonal to polarity classification and opinion extraction. we build regression models by incorporating a diverse set of features, and achieve highly competitive performance for utility scoring on three real-world data sets surrounding text:rating is only part of a review and another main part is the review text. [***]<2> studies the utility of reviews based on natural language features. spam is a much broader concept involving all types of objectionable activities influence:3 type:2 pair index:988 citer id:800 citer title:utility scoring of product reviews citer abstract:we identify a new task in the ongoing research in text sentiment analysis: predicting utility of product reviews, which is orthogonal to polarity classification and opinion extraction. we build regression models by incorporating a diverse set of features, and achieve highly competitive performance for utility scoring on three real-world data sets citee id:639 citee title:identifying sources of opinions with conditional random fields and extraction patterns citee abstract:recent systems have been developed for sentiment classification, opinion recognition, and opinion analysis (e.g., detecting polarity and strength). we pursue another aspect of opinion analysis: identifying the sources of opinions, emotions, and sentiments. we view this problem as an information extraction task and adopt a hybrid approach that combines conditional random fields (lafferty et al., 2001) and a variation of autoslog (riloff, 1996a). while crfs model source identification as a sequence tagging task, autoslog learns extraction patterns. our results show that the combination of these two methods performs better than either one alone. the resulting system identifies opinion sources with 79.3% precision and 59.5% recall using a head noun matching measure, and 81.2% precision and 60.6% recall using an overlap measure surrounding text:with this approach, the system was able to automatically identify the contextual polarity for a large subset of sentiment expressions, achieving results that are significantly better than baseline. in [***]<2>, another aspect of opinion analysis was pursued: identifying the sources of opinions, emotions, and sentiments. the problem was viewed as an information extraction task influence:3 type:2 pair index:989 citer id:800 citer title:utility scoring of product reviews citer abstract:we identify a new task in the ongoing research in text sentiment analysis: predicting utility of product reviews, which is orthogonal to polarity classification and opinion extraction. we build regression models by incorporating a diverse set of features, and achieve highly competitive performance for utility scoring on three real-world data sets citee id:629 citee title:mining the peanut gallery: opinion extraction and semantic classification of product reviews citee abstract:the web contains a wealth of product reviews, but sifting through them is a daunting task. ideally, an opinion mining tool would process a set of search results for a given item, generating a list of product attributes (quality, features, etc.) and aggregating opinions about each of them (poor, mixed, good). we begin by identifying the unique properties of this problem and develop a method for automatically distinguishing between positive and negative reviews. our classifier draws on information retrieval techniques for feature extraction and scoring, and the results for various metrics and heuristics vary depending on the testing situation. the best methods work as well as or better than traditional machine learning. when operating on individual sentences collected from web searches, performance is limited due to noise and ambiguity. but in the context of a complete web-based tool and aided by a simple method for grouping sentences into attributes, the results are qualitatively quite useful. surrounding text:in turn, the polarity of a review is predicted by the average semantic orientation of the phrases in the review that contain adjectives or adverbs. [***]<2> identified a set of features for automatically distinguishing between positive and negative reviews, and empirically compared a number of classifiers on cnet and amazon review data. [9]<2> also considered the problem of classifying documents by overall sentiment, e influence:3 type:2 pair index:990 citer id:800 citer title:utility scoring of product reviews citer abstract:we identify a new task in the ongoing research in text sentiment analysis: predicting utility of product reviews, which is orthogonal to polarity classification and opinion extraction. we build regression models by incorporating a diverse set of features, and achieve highly competitive performance for utility scoring on three real-world data sets citee id:459 citee title:effects of adjective orientation and gradability on sentence subjectivity citee abstract:subjectivity is a pragmatic, sentence-level feature that has important implications for texl processing applicalions such as information exlractiou and information iclricwd. we study tile elfeels of dymunic adjectives, semantically oriented adjectives, and gradable ad.ieclivcs on a simple subjectivity classiiicr, and establish lhat lhcy arc strong predictors of subjectivity. a novel trainable mclhod thai statistically combines two indicators of gradability is presented and ewlhlalcd, complementing exisling automatic icchniques for assigning orientation labels. surrounding text:[18]<2> tried to learn a collection of subjective adjectives using word clustering according to distributional similarity [5]<3>, seeded by a small amount of detailed manual annotation. with a finer taxonomy of subjective adjectives in mind, [***]<2> learned sets of dynamic adjectives, semantically oriented adjectives, and gradable adjectives from corpora using a simple log-linear model, and established that these adjectives are strong predictors of subjectivity. not only adjectives, but also nouns can be subjective influence:3 type:2 pair index:991 citer id:800 citer title:utility scoring of product reviews citer abstract:we identify a new task in the ongoing research in text sentiment analysis: predicting utility of product reviews, which is orthogonal to polarity classification and opinion extraction. we build regression models by incorporating a diverse set of features, and achieve highly competitive performance for utility scoring on three real-world data sets citee id:270 citee title:automatic retrieval and clustering of similar words citee abstract:bootstrapping semantics from text is one of the greatest challenges in natural language learning. earlier research showed that it is possible to automatically identify words that are semantically similar to a given word based on the syntactic collocation patterns of the words. we present an approach that goes a step further by obtaining a tree structure among the most similar words so that different senses of a given word can be identified with different subtrees. submission type: paper topic surrounding text:the importance of acquiring lexical clues for subjectivity analysis has been recognized. [18]<2> tried to learn a collection of subjective adjectives using word clustering according to distributional similarity [***]<3>, seeded by a small amount of detailed manual annotation. with a finer taxonomy of subjective adjectives in mind, [4]<2> learned sets of dynamic adjectives, semantically oriented adjectives, and gradable adjectives from corpora using a simple log-linear model, and established that these adjectives are strong predictors of subjectivity influence:3 type:3 pair index:992 citer id:800 citer title:utility scoring of product reviews citer abstract:we identify a new task in the ongoing research in text sentiment analysis: predicting utility of product reviews, which is orthogonal to polarity classification and opinion extraction. we build regression models by incorporating a diverse set of features, and achieve highly competitive performance for utility scoring on three real-world data sets citee id:114 citee title:opinion observer: analyzing and comparing opinions on the web citee abstract:the web has become an excellent source for gathering consumer opinions. there are now numerous web sites containing such opinions, e.g., customer reviews of products, forums, discussion groups, and blogs. this paper focuses on online customer reviews of products. it makes two contributions. first, it proposes a novel framework for analyzing and comparing consumer opinions of competing products. a prototype system called opinion observer is also implemented. the system is such that with a single glance of its visualization, the user is able to clearly see the strengths and weaknesses of each product in the minds of consumers in terms of various product features. this comparison is useful to both potential customers and product manufacturers. for a potential customer, he/she can see a visual side-by-side and feature-by-feature comparison of consumer opinions on these products, which helps him/her to decide which product to buy. for a product manufacturer, the comparison enables it to easily gather marketing intelligence and product benchmarking information. second, a new technique based on language pattern mining is proposed to extract product features from pros and cons in a particular type of reviews. such features form the basis for the above comparison. experimental results show that the technique is highly effective and outperform existing methods significantly. surrounding text:shifting from classification to extraction, [10]<2> introduced opine, an unsupervised information extraction system which mines reviews in order to build a model of important product features, their evaluation by reviewers, and their relative quality across products. in a similar effort, [***]<2> proposed a novel framework for analyzing and comparing consumer opinions of competing products. a prototype system called opinion observer is also implemented influence:3 type:2 pair index:993 citer id:800 citer title:utility scoring of product reviews citer abstract:we identify a new task in the ongoing research in text sentiment analysis: predicting utility of product reviews, which is orthogonal to polarity classification and opinion extraction. we build regression models by incorporating a diverse set of features, and achieve highly competitive performance for utility scoring on three real-world data sets citee id:174 citee title:a sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts citee abstract:sentiment analysis seeks to identify the viewpoint(s) underlying a text span; an example application is classifying a movie review as "thumbs up" or "thumbs down". to determine this sentiment polarity, we propose a novel machine-learning method that applies text-categorization techniques to just the subjective portions of the document. extracting these portions can be implemented using efficient techniques for finding minimum cuts in graphs; this greatly facilitates incorporation of cross-sentence contextual constraints surrounding text:however, the three machine learning methods employed (naive bayes, maximum entropy classification, and support vector machines) do not perform as well on sentiment classification as on traditional topic-based categorization, due to challenges related to non-compositional semantics and discourse structures. [***]<2> studied the same problem with an additional tweak. subjective portions of text are first extracted using a graph min-cut algorithm, and then fed into text categorization algorithms to approach sentiment polarity classification influence:2 type:2 pair index:994 citer id:800 citer title:utility scoring of product reviews citer abstract:we identify a new task in the ongoing research in text sentiment analysis: predicting utility of product reviews, which is orthogonal to polarity classification and opinion extraction. we build regression models by incorporating a diverse set of features, and achieve highly competitive performance for utility scoring on three real-world data sets citee id:117 citee title:seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales citee abstract:we address the rating-inference problem, wherein rather than simply decide whether a review is "thumbs up" or "thumbs down", as in previous sentiment analysis work, one must determine an author's evaluation with respect to a multi-point scale (e.g., one to five "stars"). this task represents an interesting twist on standard multi-class text categorization because there are several different degrees of similarity between class labels; for example, "three stars" is intuitively closer to "four stars" than to "one star". we first evaluate human performance at the task. then, we apply a meta-algorithm, based on a metric labeling formulation of the problem, that alters a given n-ary classifier's output in an explicit attempt to ensure that similar items receive similar labels. we show that the meta-algorithm can provide significant improvements over both multi-class and regression versions of svms when we employ a novel similarity measure appropriate to the problem. surrounding text:subjective portions of text are first extracted using a graph min-cut algorithm, and then fed into text categorization algorithms to approach sentiment polarity classification. pushing further along the same line, [***]<2> addressed the rating-inference problem, wherein rather than simply decide whether a review is thumbs up or thumbs down, one must determine an authors evaluation with respect to a multi-point scale (e. g influence:3 type:2 pair index:995 citer id:800 citer title:utility scoring of product reviews citer abstract:we identify a new task in the ongoing research in text sentiment analysis: predicting utility of product reviews, which is orthogonal to polarity classification and opinion extraction. we build regression models by incorporating a diverse set of features, and achieve highly competitive performance for utility scoring on three real-world data sets citee id:118 citee title:thumbs up? sentiment classification using machine learning techniques citee abstract:we consider the problem of classifying documents not by topic, but by overall sentiment, e.g., determining whether a review is positive or negative. using movie reviews as data, we find that standard machine learning techniques definitively outperform human-produced baselines. however, the three machine learning methods we employed (naive bayes, maximum entropy classification, and support vector machines) do not perform as well on sentiment classification as on traditional topic-based categorization. we conclude by examining factors that make the sentiment classification problem more challenging. surrounding text:[3]<2> identified a set of features for automatically distinguishing between positive and negative reviews, and empirically compared a number of classifiers on cnet and amazon review data. [***]<2> also considered the problem of classifying documents by overall sentiment, e. g influence:3 type:2 pair index:996 citer id:800 citer title:utility scoring of product reviews citer abstract:we identify a new task in the ongoing research in text sentiment analysis: predicting utility of product reviews, which is orthogonal to polarity classification and opinion extraction. we build regression models by incorporating a diverse set of features, and achieve highly competitive performance for utility scoring on three real-world data sets citee id:119 citee title:extracting product features and opinions from reviews citee abstract:consumers are often forced to wade through many on-line reviews in order to make an informed product choice. this paper introduces opine, an unsupervised information-extraction system which mines reviews in order to build a model of important product features, their evaluation by reviewers, and their relative quality across products.compared to previous work, opine achieves 22% higher precision (with only 3% lower recall) on the feature extraction task. opine's novel use of relaxation labeling for finding the semantic orientation of words in context leads to strong performance on the tasks of finding opinion phrases and their polarity. surrounding text:a metric labeling approach was compared with both multi-class and regression versions of svms. shifting from classification to extraction, [***]<2> introduced opine, an unsupervised information extraction system which mines reviews in order to build a model of important product features, their evaluation by reviewers, and their relative quality across products. in a similar effort, [6]<2> proposed a novel framework for analyzing and comparing consumer opinions of competing products influence:2 type:2 pair index:997 citer id:800 citer title:utility scoring of product reviews citer abstract:we identify a new task in the ongoing research in text sentiment analysis: predicting utility of product reviews, which is orthogonal to polarity classification and opinion extraction. we build regression models by incorporating a diverse set of features, and achieve highly competitive performance for utility scoring on three real-world data sets citee id:120 citee title:learning extraction patterns for subjective expressions citee abstract:this paper presents a bootstrapping process that learns linguistically rich extraction patterns for subjective (opinionated) expressions. high-precision classifiers label unannotated data to automatically create a large training set, which is then given to an extraction pattern learning algorithm. the learned patterns are then used to identify more subjective sentences. the bootstrapping process learns many subjective patterns and increases recall while maintaining high precision. surrounding text:in [13]<2>, a bootstrapping technique is used to learn subjective nouns from corpora. at the phrase level, [***]<2> presented a bootstrapping process that learns linguistically rich extraction patterns for subjective (opinionated) expressions. high-precision classifiers are used to label unannotated data and automatically create a large training set, which is then given to an extraction pattern learning algorithm influence:3 type:2 pair index:998 citer id:800 citer title:utility scoring of product reviews citer abstract:we identify a new task in the ongoing research in text sentiment analysis: predicting utility of product reviews, which is orthogonal to polarity classification and opinion extraction. we build regression models by incorporating a diverse set of features, and achieve highly competitive performance for utility scoring on three real-world data sets citee id:543 citee title:exploiting subjectivity classification to improve information extraction citee abstract:information extraction (ie) systems are prone to false hits for a variety of reasons and we observed that many of these false hits occur in sentences that contain subjective language (e.g., opinions, emotions, and sentiments). motivated by these observations, we explore the idea of using subjectivity analysis to improve the precision of information extraction systems. in this paper, we describe an ie system that uses a subjective sentence classifier to filter its extractions. we experimented with several different strategies for using the subjectivity classifications, including an aggressive strategy that discards all extractions found in subjective sentences and more complex strategies that selectively discard extractions. we evaluated the performance of these different approaches on the muc-4 terrorism data set. we found that indiscriminately filtering extractions from subjective sentences was overly aggressive, but more selective filtering strategies improved ie precision with minimal recall loss. introduction surrounding text:a hybrid approach was adopted, which combines conditional random fields and extraction pattern learning. subjectivity analysis has also been shown to be useful for other nlp applications such as information extraction [***]<3> and question answering [15]<3>. 2 influence:3 type:3 pair index:999 citer id:800 citer title:utility scoring of product reviews citer abstract:we identify a new task in the ongoing research in text sentiment analysis: predicting utility of product reviews, which is orthogonal to polarity classification and opinion extraction. we build regression models by incorporating a diverse set of features, and achieve highly competitive performance for utility scoring on three real-world data sets citee id:696 citee title:learning subjective nouns using extraction pattern bootstrapping citee abstract:we explore the idea of creating a subjectivity classifier that uses lists of subjective nouns learned by bootstrapping algorithms. the goal of our research is to develop a system that can distinguish subjective sentences from objective sentences. first, we use two bootstrapping algorithms that exploit extraction patterns to learn sets of subjective nouns. then we train a naive bayes classifier using the subjective nouns, discourse features, and subjectivity clues identified in prior surrounding text:not only adjectives, but also nouns can be subjective. in [***]<2>, a bootstrapping technique is used to learn subjective nouns from corpora. at the phrase level, [11]<2> presented a bootstrapping process that learns linguistically rich extraction patterns for subjective (opinionated) expressions influence:3 type:2 pair index:1000 citer id:800 citer title:utility scoring of product reviews citer abstract:we identify a new task in the ongoing research in text sentiment analysis: predicting utility of product reviews, which is orthogonal to polarity classification and opinion extraction. we build regression models by incorporating a diverse set of features, and achieve highly competitive performance for utility scoring on three real-world data sets citee id:566 citee title:introduction to modern information retrieval citee abstract:new technology now allows the design of sophisticated information retrieval system that can not only analyze, process and store, but can also retrieve specific resources matching a particular users needs. this clear and practical text relates the theory, techniques and tools critical to making information retrieval work. a completely revised second edition incorporates the latest developments in this rapidly expanding field, including multimedia information retrieval, user interfaces and digital libraries. chowdhurys coverage is comprehensive, including classification, cataloging, subject indexing, ing, vocabulary control; cd-rom and online information retrieval; multimedia, hypertext and hypermedia; expert systems and natural language processing; user interface systems; internet, world wide web and digital library environments. illustrated with many examples and comprehensively referenced for an international audience, this is an ideal textbook for students of library and information studies and those professionals eager to advance their knowledge of the future of information. surrounding text:with these motivations in mind, we measure the similarity between customer review and product specification, sim(t,s), and that between customer review and editorial review, sim(t,e), respectively. we use the standard cosine similarity in vector space model, with tf*idf term weighting, as defined in information retrieval literature [***]<3>. 5 influence:3 type:3 pair index:1001 citer id:800 citer title:utility scoring of product reviews citer abstract:we identify a new task in the ongoing research in text sentiment analysis: predicting utility of product reviews, which is orthogonal to polarity classification and opinion extraction. we build regression models by incorporating a diverse set of features, and achieve highly competitive performance for utility scoring on three real-world data sets citee id:760 citee title:multi-perspective question answering using the opqa corpus citee abstract:we investigate techniques to support the answering of opinion-based questions. we first present the opqa corpus of opinion questions and answers. using the corpus, we compare and contrast the properties of fact and opinion questions and answers. based on the disparate characteristics of opinion vs. fact answers, we argue that traditional fact-based qa approaches may have difficulty in an mpqa setting without modification. as an initial step towards the development of mpqa systems, we investigate the use of machine learning and rule-based subjectivity and opinion source filters and show that they can be used to guide mpqa systems surrounding text:a hybrid approach was adopted, which combines conditional random fields and extraction pattern learning. subjectivity analysis has also been shown to be useful for other nlp applications such as information extraction [12]<3> and question answering [***]<3>. 2 influence:2 type:3,2 pair index:1002 citer id:800 citer title:utility scoring of product reviews citer abstract:we identify a new task in the ongoing research in text sentiment analysis: predicting utility of product reviews, which is orthogonal to polarity classification and opinion extraction. we build regression models by incorporating a diverse set of features, and achieve highly competitive performance for utility scoring on three real-world data sets citee id:6 citee title:a bootstrapping method for learning semantic lexicons using extraction pattern contexts citee abstract:this paper describes a bootstrapping algorithm called basilisk that learns high-quality semantic lexicons for multiple categories. basilisk begins with an unannotated corpus and seed words for each semantic category, which are then bootstrapped to learn new words for each category. basilisk hypothesizes the semantic class of a word based on collective information over a large body of extraction pattern contexts. we evaluate basilisk on six semantic categories. the semantic lexicons produced by basilisk have higher precision than those produced by previous techniques, with several categories showing substantial improvement surrounding text:3. the list of strong subjective nouns and weak subjective nouns generated by basilisk [***]<1>. 4 influence:3 type:3 pair index:1003 citer id:800 citer title:utility scoring of product reviews citer abstract:we identify a new task in the ongoing research in text sentiment analysis: predicting utility of product reviews, which is orthogonal to polarity classification and opinion extraction. we build regression models by incorporating a diverse set of features, and achieve highly competitive performance for utility scoring on three real-world data sets citee id:694 citee title:learning subjective adjectives from corpora citee abstract:subjectivity tagging is distinguishing sentences used to present opinions and evaluations from sentences used to objectively present factual information. there are numerous applications for which subjectivity tagging is relevant, including information extraction and information retrieval. this paper identifies strong clues of subjectivity using the results of a method for clustering words according to distributional similarity (lin 1998), seeded by a small amount of detailed manual annotation surrounding text:the importance of acquiring lexical clues for subjectivity analysis has been recognized. [***]<2> tried to learn a collection of subjective adjectives using word clustering according to distributional similarity [5]<3>, seeded by a small amount of detailed manual annotation. with a finer taxonomy of subjective adjectives in mind, [4]<2> learned sets of dynamic adjectives, semantically oriented adjectives, and gradable adjectives from corpora using a simple log-linear model, and established that these adjectives are strong predictors of subjectivity influence:3 type:2 pair index:1004 citer id:800 citer title:utility scoring of product reviews citer abstract:we identify a new task in the ongoing research in text sentiment analysis: predicting utility of product reviews, which is orthogonal to polarity classification and opinion extraction. we build regression models by incorporating a diverse set of features, and achieve highly competitive performance for utility scoring on three real-world data sets citee id:695 citee title:learning subjective language citee abstract:subjectivity in natural language refers to aspects of language used to express opinions, evaluations, and speculations. there are numerous natural language processing applications for which subjectivity analysis is relevant, including information extraction and text categorization. the goal of this work is learning subjective language from corpora. clues of subjectivity are generated and tested, including low-frequency words, collocations, and adjectives and verbs identified using distributional similarity. the features are also examined working together in concert. the features, generated from different data sets using different procedures, exhibit consistency in performance in that they all do better and worse on the same data sets. in addition, this article shows that the density of subjectivity clues in the surrounding context strongly affects how likely it is that a word is subjective, and it provides the results of an annotation study assessing the subjectivity of sentences with high-density features. finally, the clues are used to perform opinion piece recognition (a type of text categorization and genre detection) to demonstrate the utility of the knowledge acquired in this article. surrounding text:1 subjectivity in text subjectivity in natural language refers to aspects of language used to express opinions, evaluations, and speculations. [***]<0> provided a good overview of work in learning subjective language from corpora. clues of subjectivity are generated and tested, including low-frequency words, collocations, and adjectives and verbs identified using distributional similarity influence:3 type:2 pair index:1005 citer id:800 citer title:utility scoring of product reviews citer abstract:we identify a new task in the ongoing research in text sentiment analysis: predicting utility of product reviews, which is orthogonal to polarity classification and opinion extraction. we build regression models by incorporating a diverse set of features, and achieve highly competitive performance for utility scoring on three real-world data sets citee id:833 citee title:recognizing contextual polarity in phrase-level sentiment analysis citee abstract:this paper presents a new approach to phrase-level sentiment analysis that first determines whether an expression is neutral or polar and then disambiguates the polarity of the polar expressions. with this approach, the system is able to automatically identify the contextual polarity for a large subset of sentiment expressions, achieving results that are significantly better than baseline. surrounding text:the bootstrapping process learns many subjective patterns and increases recall in subjectivity identification while maintaining high precision. instead of viewing subjectivity as a sentence- or passagelevel property and studying the overall polarity of text, [***]<2> focused on phrase-level sentiment analysis. they first determined whether an expression is neutral or polar and then disambiguated the polarity of the polar expressions influence:3 type:2 pair index:1006 citer id:800 citer title:utility scoring of product reviews citer abstract:we identify a new task in the ongoing research in text sentiment analysis: predicting utility of product reviews, which is orthogonal to polarity classification and opinion extraction. we build regression models by incorporating a diverse set of features, and achieve highly competitive performance for utility scoring on three real-world data sets citee id:123 citee title:towards answering opinion questions: separating facts from opinions and identifying the polarity of opinion sentences citee abstract:opinion question answering is a challenging task for natural language processing. in this paper, we discuss a necessary component for an opinion question answering system: separating opinions from fact, at both the document and sentence level. we present a bayesian classifier for discriminating between documents with a preponderance of opinions such as editorials from regular news stories, and describe three unsupervised, statistical techniques for the significantly harder task of detecting opinions at the sentence level. we also present a first model for classifying opinion sentences as positive or negative in terms of the main perspective being expressed in the opinion. results from a large collection of news stories and a human evaluation of 400 sentences are reported, indicating that we achieve very high performance in document classification (upwards of 97% precision and recall), and respectable performance in detecting opinions and classifying them at the sentence level as positive, negative, or neutral (up to 91% accuracy). surrounding text:a wide range of features were used, including new syntactic features developed for opinion recognition. yu and hatzivassiloglou [***]<2> approached the problem of separating opinions from fact, at both the document and sentence level. they presented a bayesian classifier for discriminating between documents with a preponderance of opinions such as editorials from regular (fact-based) news stories, and described three unsupervised statistical techniques for the task of detecting opinions at the sentence level influence:2 type:2 pair index:1007 citer id:836 citer title:review spam detection citer abstract:it is now a common practice for e-commerce web sites to enable their customers to write reviews of products that they have purchased. such reviews provide valuable sources of information on these products. they are used by potential customers to find opinions of existing users before deciding to purchase a product. they are also used by product manufacturers to identify problems of their products and to find competitive intelligence information about their competitors. unfortunately, this importance of reviews also gives good incentive for spam, which contains false positive or malicious negative opinions. in this paper, we make an attempt to study review spam and spam detection. to the best of our knowledge, there is still no reported study on this problem citee id:776 citee title:on the resemblance and containment of documents citee abstract:given two documents a and b we de ne two mathematical notions: their resemblance r(a; b) and their containment c(a; b) that seem to capture well the informal notions of \roughly the same" and \roughly containedfi " the basic idea is to reduce these issues to set intersection problems that can be easily evaluated by a process of random sampling that can be done inde-pendently for each documentfi furthermore, the resemblance can be evaluated using a xed size sample for each documentfi this paper discusses the mathematical properties of these measures and the e cient implementation of the sampling process using rabin ngerprintsfi surrounding text:for example, different userids posted duplicate or near duplicate reviews on the same product or different products. duplicate detection is done using the shingle method [***]<1> with similarity score > 0. 9 influence:3 type:3 pair index:1008 citer id:836 citer title:review spam detection citer abstract:it is now a common practice for e-commerce web sites to enable their customers to write reviews of products that they have purchased. such reviews provide valuable sources of information on these products. they are used by potential customers to find opinions of existing users before deciding to purchase a product. they are also used by product manufacturers to identify problems of their products and to find competitive intelligence information about their competitors. unfortunately, this importance of reviews also gives good incentive for spam, which contains false positive or malicious negative opinions. in this paper, we make an attempt to study review spam and spam detection. to the best of our knowledge, there is still no reported study on this problem citee id:797 citee title:web spam taxonomy citee abstract:web spamming refers to actions intended to mislead search engines into ranking some pages higher than they deserve. recently, the amount of web spam has increased dramatically, leading to a degradation of search results. this paper presents a comprehensive taxonomy of current spamming techniques, which we believe can help in developing appropriate countermeasures. surrounding text:spam reviews are very different as they give false opinions, which are much harder to detect even manually. thus, most existing methods for detecting web spam and email spam [***, 7, 9, 11]<2> are unsuitable for review spam. in this work, we study review spam influence:2 type:2 pair index:1009 citer id:836 citer title:review spam detection citer abstract:it is now a common practice for e-commerce web sites to enable their customers to write reviews of products that they have purchased. such reviews provide valuable sources of information on these products. they are used by potential customers to find opinions of existing users before deciding to purchase a product. they are also used by product manufacturers to identify problems of their products and to find competitive intelligence information about their competitors. unfortunately, this importance of reviews also gives good incentive for spam, which contains false positive or malicious negative opinions. in this paper, we make an attempt to study review spam and spam detection. to the best of our knowledge, there is still no reported study on this problem citee id:403 citee title:mining and summarizing customer reviews citee abstract:merchants selling products on the web often ask their customers to review the products that they have purchased and the associated services. as e-commerce is becoming more and more popular, the number of customer reviews that a product receives grows rapidly. for a popular product, the number of reviews can be in hundreds or even thousands. this makes it difficult for a potential customer to read them to make an informed decision on whether to purchase the product. it also makes it difficult for the manufacturer of the product to keep track and to manage customer opinions. for the manufacturer, there are additional difficulties because many merchant sites may sell the same product and the manufacturer normally produces many kinds of products. in this research, we aim to mine and to summarize all the customer reviews of a product. this summarization task is different from traditional text summarization because we only mine the features of the product on which the customers have expressed their opinions and whether the opinions are positive or negative. we do not summarize the reviews by selecting a subset or rewrite some of the original sentences from the reviews to capture the main points as in the classic text summarization. our task is performed in three steps: (1) mining product features that have been commented on by customers; (2) identifying opinion sentences in each review and deciding whether each opinion sentence is positive or negative; (3) summarizing the results. this paper proposes several novel techniques to perform these tasks. our experimental results using reviews of a number of products sold online demonstrate the effectiveness of the techniques surrounding text:it is now well recognized that such user generated contents on the web provide valuable information that can be exploited for many applications. in this paper, we focus on customer reviews of products, which contain information of consumer opinions on the products, and are useful to both potential customers and product manufacturers [***, 8]<3>. recently, there was a growing interest in mining opinions from reviews, however, the existing work is mainly on extracting positive and negative opinions using natural language processing techniques [e influence:2 type:3,2 pair index:1010 citer id:836 citer title:review spam detection citer abstract:it is now a common practice for e-commerce web sites to enable their customers to write reviews of products that they have purchased. such reviews provide valuable sources of information on these products. they are used by potential customers to find opinions of existing users before deciding to purchase a product. they are also used by product manufacturers to identify problems of their products and to find competitive intelligence information about their competitors. unfortunately, this importance of reviews also gives good incentive for spam, which contains false positive or malicious negative opinions. in this paper, we make an attempt to study review spam and spam detection. to the best of our knowledge, there is still no reported study on this problem citee id:560 citee title:fast statistical spam filter by approximate classifications citee abstract:statistical-based bayesian filters have become a popular and important defense against spam. however, despite their effectiveness, their greater processing overhead can prevent them from scaling well for enterprise-level mail servers. for example, the dictionary lookups that are characteristic of this approach are limited by the memory access rate, therefore relatively insensitive to increases in cpu speed. we address this scaling issue by proposing an acceleration technique that speeds up bayesian filters based on approximate classification. the approximation uses two methods: hash-based lookup and lossy encoding. lookup approximation is based on the popular bloom filter data structure with an extension to support value retrieval. lossy encoding is used to further compress the data structure. while both methods introduce additional errors to a strict bayesian approach, we show how the errors can be both minimized and biased toward a false negative classification.we demonstrate a 6x speedup over two well-known spam filters (bogofilter and qsf) while achieving an identical false positive rate and similar false negative rate to the original filters. surrounding text:spam reviews are very different as they give false opinions, which are much harder to detect even manually. thus, most existing methods for detecting web spam and email spam [3, ***, 9, 11]<2> are unsuitable for review spam. in this work, we study review spam influence:2 type:2 pair index:1011 citer id:836 citer title:review spam detection citer abstract:it is now a common practice for e-commerce web sites to enable their customers to write reviews of products that they have purchased. such reviews provide valuable sources of information on these products. they are used by potential customers to find opinions of existing users before deciding to purchase a product. they are also used by product manufacturers to identify problems of their products and to find competitive intelligence information about their competitors. unfortunately, this importance of reviews also gives good incentive for spam, which contains false positive or malicious negative opinions. in this paper, we make an attempt to study review spam and spam detection. to the best of our knowledge, there is still no reported study on this problem citee id:409 citee title:detecting spam web pages through content analysis citee abstract:in this paper, we continue our investigations of "web spam": the injection of artificially-created pages into the web in order to influence the results from search engines, to drive traffic to certain pages for fun or profit. this paper considers some previously-undescribed techniques for automatically detecting spam pages, examines the effectiveness of these techniques in isolation and when aggregated using classification algorithms. when combined, our heuristics correctly identify 2,037 (86.2%) of the 2,364 spam pages (13.8%) in our judged collection of 17,168 pages, while misidentifying 526 spam and non-spam pages (3.1%). surrounding text:spam reviews are very different as they give false opinions, which are much harder to detect even manually. thus, most existing methods for detecting web spam and email spam [3, 7, ***, 11]<2> are unsuitable for review spam. in this work, we study review spam influence:2 type:2 pair index:1012 citer id:836 citer title:review spam detection citer abstract:it is now a common practice for e-commerce web sites to enable their customers to write reviews of products that they have purchased. such reviews provide valuable sources of information on these products. they are used by potential customers to find opinions of existing users before deciding to purchase a product. they are also used by product manufacturers to identify problems of their products and to find competitive intelligence information about their competitors. unfortunately, this importance of reviews also gives good incentive for spam, which contains false positive or malicious negative opinions. in this paper, we make an attempt to study review spam and spam detection. to the best of our knowledge, there is still no reported study on this problem citee id:799 citee title:topical trustrank: using topicality to combat web spam citee abstract:web spam is behavior that attempts to deceive search engine ranking algorithms. trustrank is a recent algorithm that can combat web spam. however, trustrank is vulnerable in the sense that the seed set used by trustrank may not be sufficiently representative to cover well the different topics on the web. also, for a given seed set, trustrank has a bias towards larger communities. we propose the use of topical information to partition the seed set and calculate trust scores for each topic separately to address the above issues. a combination of these trust scores for a page is used to determine its ranking. experimental results on two large datasets show that our topical trustrank has a better performance than trustrank in demoting spam sites or pages. compared to trustrank, our best technique can decrease spam from the top ranked sites by as much as 43.1% surrounding text:spam reviews are very different as they give false opinions, which are much harder to detect even manually. thus, most existing methods for detecting web spam and email spam [3, 7, 9, ***]<2> are unsuitable for review spam. in this work, we study review spam influence:2 type:2 pair index:1013 citer id:841 citer title:sentiment analysis capturing favorability using natural language processing citer abstract:this paper illustrates a sentiment analysis approach to extract sentiments associated with polarities of positive or negative for specific subjects from a document, instead of classifying the whole document into positive or negative. the essential issues in sentiment analysis are to identify how sentiments are expressed in texts and whether the expressions indicate positive (favorable) or negative (unfavorable) opinions toward the subject. in order to improve the accuracy of the sentiment analysis, it is important to properly identify the semantic relationships between the sentiment expressions and the subject. by applying semantic analysis with a syntactic parser and sentiment lexicon, our prototype system achieved high precision (75-95%, depending on the data) in finding sentiments within web pages and news articles citee id:118 citee title:thumbs up? sentiment classification using machine learning techniques citee abstract:we consider the problem of classifying documents not by topic, but by overall sentiment, e.g., determining whether a review is positive or negative. using movie reviews as data, we find that standard machine learning techniques definitively outperform human-produced baselines. however, the three machine learning methods we employed (naive bayes, maximum entropy classification, and support vector machines) do not perform as well on sentiment classification as on traditional topic-based categorization. we conclude by examining factors that make the sentiment classification problem more challenging. surrounding text:features of expressions to be used for sentiment analysis such as collocations [12,14]<2> and adjectives [5]<2> . acquisition of sentiment expressions and their polarities from supervised corpora, in which favorability in each document is explicitly assigned manually, such as five stars in reviews [***]<2>, and unsupervised corpora, such as the www [13]<2>, in which no clue on sentiment polarity is available except for the textual content [4]<2> in all of this work, the level of natural language processing (nlp) was shallow. except for stemming and analysis of part of speech (pos), they simply analyze co-occurrences of expressions within a short distance [7,12]<2> or patterns [1]<2> that are typically used for information extraction [3,10]<3> to analyze the relationships among expressions. many of their applications aim to classify the whole document into positive or negative toward a subject of the document that is specified either explicitly or implicitly [1-2,11-13]<2>, and the subject of all of the sentiment expressions are assumed to be the same as the document subject. for example, the classification of a movie review into positive or negative [***,13]<2> assumes that all sentiment expressions in the review represent sentiments directly toward that movie, and expressions that violate this assumption (such as a negative comment about an actor even though the movie as a whole is considered to be excellent) confuse the judgment of the classification. on the contrary, by analyzing the relationships between sentiment expressions and subjects, we can make in-depth analyses on what is favored and what is not influence:1 type:2 pair index:1014 citer id:841 citer title:sentiment analysis capturing favorability using natural language processing citer abstract:this paper illustrates a sentiment analysis approach to extract sentiments associated with polarities of positive or negative for specific subjects from a document, instead of classifying the whole document into positive or negative. the essential issues in sentiment analysis are to identify how sentiments are expressed in texts and whether the expressions indicate positive (favorable) or negative (unfavorable) opinions toward the subject. in order to improve the accuracy of the sentiment analysis, it is important to properly identify the semantic relationships between the sentiment expressions and the subject. by applying semantic analysis with a syntactic parser and sentiment lexicon, our prototype system achieved high precision (75-95%, depending on the data) in finding sentiments within web pages and news articles citee id:720 citee title:message understanding conference - 6: a brief history citee abstract:we have recently completed the sixth in a series of "message understanding conferences" which are designed to promote and evaluate research in information extraction. muc-6 introduced several innovations over prior mucs, most notably in the range of different tasks for which evaluations were conducted. we describe some of the motivations for the new format and briefly discuss some of the results of the evaluations surrounding text:acquisition of sentiment expressions and their polarities from supervised corpora, in which favorability in each document is explicitly assigned manually, such as five stars in reviews [2]<2>, and unsupervised corpora, such as the www [13]<2>, in which no clue on sentiment polarity is available except for the textual content [4]<2> in all of this work, the level of natural language processing (nlp) was shallow. except for stemming and analysis of part of speech (pos), they simply analyze co-occurrences of expressions within a short distance [7,12]<2> or patterns [1]<2> that are typically used for information extraction [***,10]<3> to analyze the relationships among expressions. analysis of relationships based on distance obviously has limitations influence:3 type:3 pair index:1015 citer id:841 citer title:sentiment analysis capturing favorability using natural language processing citer abstract:this paper illustrates a sentiment analysis approach to extract sentiments associated with polarities of positive or negative for specific subjects from a document, instead of classifying the whole document into positive or negative. the essential issues in sentiment analysis are to identify how sentiments are expressed in texts and whether the expressions indicate positive (favorable) or negative (unfavorable) opinions toward the subject. in order to improve the accuracy of the sentiment analysis, it is important to properly identify the semantic relationships between the sentiment expressions and the subject. by applying semantic analysis with a syntactic parser and sentiment lexicon, our prototype system achieved high precision (75-95%, depending on the data) in finding sentiments within web pages and news articles citee id:459 citee title:effects of adjective orientation and gradability on sentence subjectivity citee abstract:subjectivity is a pragmatic, sentence-level feature that has important implications for texl processing applicalions such as information exlractiou and information iclricwd. we study tile elfeels of dymunic adjectives, semantically oriented adjectives, and gradable ad.ieclivcs on a simple subjectivity classiiicr, and establish lhat lhcy arc strong predictors of subjectivity. a novel trainable mclhod thai statistically combines two indicators of gradability is presented and ewlhlalcd, complementing exisling automatic icchniques for assigning orientation labels. surrounding text:specifically, the focus items include the following: . features of expressions to be used for sentiment analysis such as collocations [12,14]<2> and adjectives [***]<2> . acquisition of sentiment expressions and their polarities from supervised corpora, in which favorability in each document is explicitly assigned manually, such as five stars in reviews [2]<2>, and unsupervised corpora, such as the www [13]<2>, in which no clue on sentiment polarity is available except for the textual content [4]<2> in all of this work, the level of natural language processing (nlp) was shallow influence:2 type:2 pair index:1016 citer id:841 citer title:sentiment analysis capturing favorability using natural language processing citer abstract:this paper illustrates a sentiment analysis approach to extract sentiments associated with polarities of positive or negative for specific subjects from a document, instead of classifying the whole document into positive or negative. the essential issues in sentiment analysis are to identify how sentiments are expressed in texts and whether the expressions indicate positive (favorable) or negative (unfavorable) opinions toward the subject. in order to improve the accuracy of the sentiment analysis, it is important to properly identify the semantic relationships between the sentiment expressions and the subject. by applying semantic analysis with a syntactic parser and sentiment lexicon, our prototype system achieved high precision (75-95%, depending on the data) in finding sentiments within web pages and news articles citee id:570 citee title:foundations of statistical natural language processing citee abstract:statistical approaches to processing natural language text have become dominant in recent years. this foundational text is the first comprehensive introduction to statistical natural language processing (nlp) to appear. the book contains all the theory and algorithms needed for building nlp tools. it provides broad but rigorous coverage of mathematical and linguistic foundations, as well as detailed discussion of statistical methods, allowing students and researchers to construct their own implementations. the book covers collocation finding, word sense disambiguation, probabilistic parsing, information retrieval, and other applications. surrounding text:furthermore, in order to maintain robustness for noisy texts from various sources such as the www, we decided to use a shallow parsing framework that identifies phrase boundaries and their local dependencies in addition to pos tagging, instead of using a full parser that tries to identify the complete dependency structure among all of the terms. for pos tagging, we used a markov-model-based tagger essentially the same as the one described in [***]<1>. this tagger assigns a part of speech to text tokens based on the distribution probabilities of candidate pos labels for each word and the probability of a pos transition extracted from a training corpus influence:2 type:1 pair index:1017 citer id:841 citer title:sentiment analysis capturing favorability using natural language processing citer abstract:this paper illustrates a sentiment analysis approach to extract sentiments associated with polarities of positive or negative for specific subjects from a document, instead of classifying the whole document into positive or negative. the essential issues in sentiment analysis are to identify how sentiments are expressed in texts and whether the expressions indicate positive (favorable) or negative (unfavorable) opinions toward the subject. in order to improve the accuracy of the sentiment analysis, it is important to properly identify the semantic relationships between the sentiment expressions and the subject. by applying semantic analysis with a syntactic parser and sentiment lexicon, our prototype system achieved high precision (75-95%, depending on the data) in finding sentiments within web pages and news articles citee id:842 citee title:the talent system: textract architecture and data model citee abstract:we present the architecture and data model for textract, a robust, scalable and configurable document analysis framework. textract has been engineered as a pipeline architecture, allowing for rapid prototyping and application development by freely mixing reusable, existing, language analysis plugins and custom, new, plugins with customizable functionality. we discuss design issues which arise from requirements of industrial strength efficiency and scalability, and which are further constrained by plugin interactions, both among themselves, and with a common data model comprising an annotation store, document vocabulary and a lexical cache. we exemplify some of these by focusing on a meta-plugin: an interpreter for annotation-based finite state transduction, through which many linguistic filters can be implemented as stand-alone plugins. the framework and component plugins have been extensively deployed in both research and industrial environments, for a broad range of text analysis and mining tasks surrounding text:at a yet higher level, clause boundaries can be marked, and even (nominal) arguments for (verb) predicates can be identified. these pos tagging and shallow parsing functionalities have been implemented using the talent system based on the textract architecture [***]<1>. after obtaining the results of the shallow parser, we analyze the syntactic dependencies among the phrases and look for phrases with a sentiment term that modifies or is modified by a subject term influence:3 type:3 pair index:1018 citer id:841 citer title:sentiment analysis capturing favorability using natural language processing citer abstract:this paper illustrates a sentiment analysis approach to extract sentiments associated with polarities of positive or negative for specific subjects from a document, instead of classifying the whole document into positive or negative. the essential issues in sentiment analysis are to identify how sentiments are expressed in texts and whether the expressions indicate positive (favorable) or negative (unfavorable) opinions toward the subject. in order to improve the accuracy of the sentiment analysis, it is important to properly identify the semantic relationships between the sentiment expressions and the subject. by applying semantic analysis with a syntactic parser and sentiment lexicon, our prototype system achieved high precision (75-95%, depending on the data) in finding sentiments within web pages and news articles citee id:625 citee title:identifying collocations for recognizing opinions citee abstract:subjectivity in natural language refers to aspects of language used to express opinions and evaluations surrounding text:specifically, the focus items include the following: . features of expressions to be used for sentiment analysis such as collocations [12,***]<2> and adjectives [5]<2> . acquisition of sentiment expressions and their polarities from supervised corpora, in which favorability in each document is explicitly assigned manually, such as five stars in reviews [2]<2>, and unsupervised corpora, such as the www [13]<2>, in which no clue on sentiment polarity is available except for the textual content [4]<2> in all of this work, the level of natural language processing (nlp) was shallow influence:2 type:2 pair index:1019 citer id:846 citer title:sentiment analyzer extractingsentiments about a given topic citer abstract:we present sentiment analyzer (sa) that extracts sentiment (or opinion) about a subject from online text documents. instead of classifying the sentiment of an entire document about a subject, sa detects all references to the given subject, and determines sentiment in each of the references using natural language processing (nlp) techniques. our sentiment analysis consists of 1) a topic specific feature term extraction, 2) sentiment extraction, and 3) (subject, sentiment) association by relationship analysis. sa utilizes two linguistic resources for the analysis: the sentiment lexicon and the sentiment pattern database. the performance of the algorithms was verified on online product review articles (digital camera and music reviews), and more general documents including general webpages and news articles citee id:568 citee title:finding parts in very large corpora citee abstract:we present a method for extracting parts of objects from wholes (e.g. "speedometer" from "car"). given a very large corpus our method finds part words with 55% accuracy for the top 50 words as ranked by the system. the part list could be scanned by an end-user and added to an existing ontology (such as wordnet), or used as a part of a rough semantic lexicon surrounding text:4. previous work [***]<2> describes a procedure that aims at extracting part-of features, using possessive constructions and prepositional phrases, from news corpus. by contrast, we extract both part-of and attribute-of relations influence:3 type:3 pair index:1020 citer id:846 citer title:sentiment analyzer extractingsentiments about a given topic citer abstract:we present sentiment analyzer (sa) that extracts sentiment (or opinion) about a subject from online text documents. instead of classifying the sentiment of an entire document about a subject, sa detects all references to the given subject, and determines sentiment in each of the references using natural language processing (nlp) techniques. our sentiment analysis consists of 1) a topic specific feature term extraction, 2) sentiment extraction, and 3) (subject, sentiment) association by relationship analysis. sa utilizes two linguistic resources for the analysis: the sentiment lexicon and the sentiment pattern database. the performance of the algorithms was verified on online product review articles (digital camera and music reviews), and more general documents including general webpages and news articles citee id:629 citee title:mining the peanut gallery: opinion extraction and semantic classification of product reviews citee abstract:the web contains a wealth of product reviews, but sifting through them is a daunting task. ideally, an opinion mining tool would process a set of search results for a given item, generating a list of product attributes (quality, features, etc.) and aggregating opinions about each of them (poor, mixed, good). we begin by identifying the unique properties of this problem and develop a method for automatically distinguishing between positive and negative reviews. our classifier draws on information retrieval techniques for feature extraction and scoring, and the results for various metrics and heuristics vary depending on the testing situation. the best methods work as well as or better than traditional machine learning. when operating on individual sentences collected from web searches, performance is limited due to noise and ambiguity. but in the context of a complete web-based tool and aided by a simple method for grouping sentences into attributes, the results are qualitatively quite useful. surrounding text:second, the association of the extracted sentiment to a specific topic is difficult. most statistical opinion extraction algorithms perform poorly in this respect as evidenced in [***]<0>. they either i) assume the topic of the document is known a priori, or ii) simply associate the opinion to a topic term co-existing in the same context influence:1 type:2 pair index:1021 citer id:846 citer title:sentiment analyzer extractingsentiments about a given topic citer abstract:we present sentiment analyzer (sa) that extracts sentiment (or opinion) about a subject from online text documents. instead of classifying the sentiment of an entire document about a subject, sa detects all references to the given subject, and determines sentiment in each of the references using natural language processing (nlp) techniques. our sentiment analysis consists of 1) a topic specific feature term extraction, 2) sentiment extraction, and 3) (subject, sentiment) association by relationship analysis. sa utilizes two linguistic resources for the analysis: the sentiment lexicon and the sentiment pattern database. the performance of the algorithms was verified on online product review articles (digital camera and music reviews), and more general documents including general webpages and news articles citee id:197 citee title:accurate methods for the statistics of surprise and coincidence citee abstract:much work has been done on the statistical analysis of text. in some cases reported in the literature, inappropriate statistical methods have been used, and statistical significance of results have not been addressed. in particular, asymptotic normality assumptions have often been used unjustifiably, leading to flawed results.this assumption of normal distribution limits the ability to analyze rare events. unfortunately rare events do make up a large fraction of real text.however, more applicable methods based on likelihood ratio tests are available that yield good results with relatively small samples. these tests can be implemented efficiently, and have been used for the detection of composite terms and for the determination of domain-specific terms. in some cases, these measures perform much better than the methods previously used. in cases where traditional contingency table methods work well, the likelihood ratio tests described here are nearly identical.this paper describes the basis of a measure based on likelihood ratios that can be applied to the analysis of text surrounding text:likelihood test. this method is based on the likelihood-ratio test by dunning [***]<2>. let d+ be a collection of documents focused on a topic t , d influence:2 type:1 pair index:1022 citer id:846 citer title:sentiment analyzer extractingsentiments about a given topic citer abstract:we present sentiment analyzer (sa) that extracts sentiment (or opinion) about a subject from online text documents. instead of classifying the sentiment of an entire document about a subject, sa detects all references to the given subject, and determines sentiment in each of the references using natural language processing (nlp) techniques. our sentiment analysis consists of 1) a topic specific feature term extraction, 2) sentiment extraction, and 3) (subject, sentiment) association by relationship analysis. sa utilizes two linguistic resources for the analysis: the sentiment lexicon and the sentiment pattern database. the performance of the algorithms was verified on online product review articles (digital camera and music reviews), and more general documents including general webpages and news articles citee id:417 citee title:direction-based text interpretation as an information access refinement citee abstract:a text-based intelligent system should provide more in-depth information about the contents of its corpus than does a standard information retrieval system, while at the same time avoiding the complexity and resource-consuming behavior of detailed text understanders. instead of focusing on discovering documents that pertain to some topic of interest to the user, an approach is introduced based on the criterion of directionality (e.g., is the agent in favor of, neutral, or opposed to the event?). a method is described for coercing sentence meanings into a metaphoric model such that the only semantic interpretation needed in order to determine the directionality of a sentence is done with respect to the model. this interpretation method is designed to be an integrated component of a hybrid information access system. surrounding text:timations, they can be costly especially if a large volume of survey data is gathered. there has been extensive research on automatic text analysis for sentiment, such as sentiment classifiers[13, ***, 16, 2, 19]<2>, affect analysis[17, 21]<2>, automatic survey analysis[8, 16]<2>, opinion extraction[12]<2>, or recommender systems [18]<2>. these methods typically try to extract the overall sentiment revealed in a document, either positive or negative, or somewhere in between. [22]<2> identifies subjective adjectives (or sentiment adjectives) from corpora. past work on sentiment-based categorization of entire documents has often involved either the use of models inspired by cognitive linguistics [***, 16]<2> or the manual or semimanual construction of discriminant-word lexicons [2, 19]<2>. [***]<2> proposed a sentence interpretation model that attempts to answer directional queries based on the deep argumentative structure of the document, but with no implementation detail or any experimental results. past work on sentiment-based categorization of entire documents has often involved either the use of models inspired by cognitive linguistics [***, 16]<2> or the manual or semimanual construction of discriminant-word lexicons [2, 19]<2>. [***]<2> proposed a sentence interpretation model that attempts to answer directional queries based on the deep argumentative structure of the document, but with no implementation detail or any experimental results. [13]<2> compares three machine learning methods (naive bayes, maximum entropy classification, and svm) for sentiment classification task influence:2 type:2 pair index:1023 citer id:846 citer title:sentiment analyzer extractingsentiments about a given topic citer abstract:we present sentiment analyzer (sa) that extracts sentiment (or opinion) about a subject from online text documents. instead of classifying the sentiment of an entire document about a subject, sa detects all references to the given subject, and determines sentiment in each of the references using natural language processing (nlp) techniques. our sentiment analysis consists of 1) a topic specific feature term extraction, 2) sentiment extraction, and 3) (subject, sentiment) association by relationship analysis. sa utilizes two linguistic resources for the analysis: the sentiment lexicon and the sentiment pattern database. the performance of the algorithms was verified on online product review articles (digital camera and music reviews), and more general documents including general webpages and news articles citee id:573 citee title:from sentence processing to information access on the world wide web citee abstract:this paper describes the start information server built at the mit artificial intelligence laboratory. available on the world wide web since december 1993, the start server provides users with access to multi-media information in response to questions formulated in english. over the last 7 years, the start server answered millions of questions from users all over the world. the start server is built on two foundations: the sentence-level natural language processing capability provided by the start natural language system (katz ) and the idea of natural language annotations for multi-media information segments. this paper starts with an overview of sentence-level processing in the start system and then explains how annotating information segments with collections of english sentences makes it possible to use the power of sentence-level natural language processing in the service of multi-media information access. the paper ends with a proposal to annotate the world wide web surrounding text:scope of sentiment analysis as a preprocessing step to our sentiment analysis, we extract sentences from input documents containing mentions of subject terms of interest. then, sa applies sentiment analysis to kernel sentences [***]<1> and some text fragments. kernel sentences usually contain only one verb. kernel sentences usually contain only one verb. for kernel sentences, sa extracts the following types of ternary expressions (t-expressions)[***]<1>: . positive or negative sentiment verbs: influence:2 type:3 pair index:1024 citer id:846 citer title:sentiment analyzer extractingsentiments about a given topic citer abstract:we present sentiment analyzer (sa) that extracts sentiment (or opinion) about a subject from online text documents. instead of classifying the sentiment of an entire document about a subject, sa detects all references to the given subject, and determines sentiment in each of the references using natural language processing (nlp) techniques. our sentiment analysis consists of 1) a topic specific feature term extraction, 2) sentiment extraction, and 3) (subject, sentiment) association by relationship analysis. sa utilizes two linguistic resources for the analysis: the sentiment lexicon and the sentiment pattern database. the performance of the algorithms was verified on online product review articles (digital camera and music reviews), and more general documents including general webpages and news articles citee id:736 citee title:mining from open answers in questionnaire data citee abstract:surveys are important tools for marketing and for managing customer relationships; the answers to open-ended questions, in particular, often contain valuable information and provide an important basis for business decisions. the summaries that human analysts make of these open answers, however, tend to rely too much on intuition surrounding text:timations, they can be costly especially if a large volume of survey data is gathered. there has been extensive research on automatic text analysis for sentiment, such as sentiment classifiers[13, 6, 16, 2, 19]<2>, affect analysis[17, 21]<2>, automatic survey analysis[***, 16]<2>, opinion extraction[12]<2>, or recommender systems [18]<2>. these methods typically try to extract the overall sentiment revealed in a document, either positive or negative, or somewhere in between influence:3 type:2 pair index:1025 citer id:846 citer title:sentiment analyzer extractingsentiments about a given topic citer abstract:we present sentiment analyzer (sa) that extracts sentiment (or opinion) about a subject from online text documents. instead of classifying the sentiment of an entire document about a subject, sa detects all references to the given subject, and determines sentiment in each of the references using natural language processing (nlp) techniques. our sentiment analysis consists of 1) a topic specific feature term extraction, 2) sentiment extraction, and 3) (subject, sentiment) association by relationship analysis. sa utilizes two linguistic resources for the analysis: the sentiment lexicon and the sentiment pattern database. the performance of the algorithms was verified on online product review articles (digital camera and music reviews), and more general documents including general webpages and news articles citee id:570 citee title:foundations of statistical natural language processing citee abstract:statistical approaches to processing natural language text have become dominant in recent years. this foundational text is the first comprehensive introduction to statistical natural language processing (nlp) to appear. the book contains all the theory and algorithms needed for building nlp tools. it provides broad but rigorous coverage of mathematical and linguistic foundations, as well as detailed discussion of statistical methods, allowing students and researchers to construct their own implementations. the book covers collocation finding, word sense disambiguation, probabilistic parsing, information retrieval, and other applications. surrounding text:bnp c11 c12 bnp c21 c22 table 1. counts for a bnp [***]<3> then, the qis are given by: qi = fi fi . pi if 1 i t 0 otherwise (1) = t i=1 fi 1 + t i=1 pi the following feature selection algorithm is the direct result of equation 1 influence:3 type:3,1 pair index:1026 citer id:846 citer title:sentiment analyzer extractingsentiments about a given topic citer abstract:we present sentiment analyzer (sa) that extracts sentiment (or opinion) about a subject from online text documents. instead of classifying the sentiment of an entire document about a subject, sa detects all references to the given subject, and determines sentiment in each of the references using natural language processing (nlp) techniques. our sentiment analysis consists of 1) a topic specific feature term extraction, 2) sentiment extraction, and 3) (subject, sentiment) association by relationship analysis. sa utilizes two linguistic resources for the analysis: the sentiment lexicon and the sentiment pattern database. the performance of the algorithms was verified on online product review articles (digital camera and music reviews), and more general documents including general webpages and news articles citee id:710 citee title:marcinkiewicz. building a large annotated corpus of english: the penn treebank citee abstract:this paper, we review our experience with constructing one such large annotated corpus---the penn treebank, a corpus surrounding text:base noun phrases (bnp). bnp restricts the candidate feature terms to one of the following base noun phrase (bnp) patterns: nn, nn nn, jj nn, nn nn nn, jj nn nn, jj jj nn, where nn and jj are the part-ofspeech( pos) tags for nouns and adjectives respectively defined by penn treebank[***]<3>. 2 influence:3 type:3 pair index:1027 citer id:846 citer title:sentiment analyzer extractingsentiments about a given topic citer abstract:we present sentiment analyzer (sa) that extracts sentiment (or opinion) about a subject from online text documents. instead of classifying the sentiment of an entire document about a subject, sa detects all references to the given subject, and determines sentiment in each of the references using natural language processing (nlp) techniques. our sentiment analysis consists of 1) a topic specific feature term extraction, 2) sentiment extraction, and 3) (subject, sentiment) association by relationship analysis. sa utilizes two linguistic resources for the analysis: the sentiment lexicon and the sentiment pattern database. the performance of the algorithms was verified on online product review articles (digital camera and music reviews), and more general documents including general webpages and news articles citee id:770 citee title:nouns in wordnet : a lexical inheritance system citee abstract:the prototypical definition of a noun consists of its immediate superordinate followed by a relative clause that describes how this instance differs from all other instances. this information provides a basis for organizing files in wordnet. wordnet is a lexical inheritance system. hyponyms are connected with their superordinates and vice versa. thus, the lexical database is a hierarchy that can be searched upward or downward with equal speed. surrounding text:- pos is the required pos tag of lexical entry. - sentiment_category : + the following is an example of the lexicon entry: "excellent" jj + we have collected sentiment words from several sources: general inquirer (gi)1, dictionary of affect of language (dal)2[21]<1>, and wordnet[***]<1>. from gi, we extracted all words in positive, negative, and hostile categories influence:3 type:3 pair index:1028 citer id:846 citer title:sentiment analyzer extractingsentiments about a given topic citer abstract:we present sentiment analyzer (sa) that extracts sentiment (or opinion) about a subject from online text documents. instead of classifying the sentiment of an entire document about a subject, sa detects all references to the given subject, and determines sentiment in each of the references using natural language processing (nlp) techniques. our sentiment analysis consists of 1) a topic specific feature term extraction, 2) sentiment extraction, and 3) (subject, sentiment) association by relationship analysis. sa utilizes two linguistic resources for the analysis: the sentiment lexicon and the sentiment pattern database. the performance of the algorithms was verified on online product review articles (digital camera and music reviews), and more general documents including general webpages and news articles citee id:118 citee title:thumbs up? sentiment classification using machine learning techniques citee abstract:we consider the problem of classifying documents not by topic, but by overall sentiment, e.g., determining whether a review is positive or negative. using movie reviews as data, we find that standard machine learning techniques definitively outperform human-produced baselines. however, the three machine learning methods we employed (naive bayes, maximum entropy classification, and support vector machines) do not perform as well on sentiment classification as on traditional topic-based categorization. we conclude by examining factors that make the sentiment classification problem more challenging. surrounding text:timations, they can be costly especially if a large volume of survey data is gathered. there has been extensive research on automatic text analysis for sentiment, such as sentiment classifiers[***, 6, 16, 2, 19]<2>, affect analysis[17, 21]<2>, automatic survey analysis[8, 16]<2>, opinion extraction[12]<2>, or recommender systems [18]<2>. these methods typically try to extract the overall sentiment revealed in a document, either positive or negative, or somewhere in between. [6]<2> proposed a sentence interpretation model that attempts to answer directional queries based on the deep argumentative structure of the document, but with no implementation detail or any experimental results. [***]<2> compares three machine learning methods (naive bayes, maximum entropy classification, and svm) for sentiment classification task. [20]<2> used the average semantic orientation of the phrases in the review influence:2 type:2 pair index:1029 citer id:846 citer title:sentiment analyzer extractingsentiments about a given topic citer abstract:we present sentiment analyzer (sa) that extracts sentiment (or opinion) about a subject from online text documents. instead of classifying the sentiment of an entire document about a subject, sa detects all references to the given subject, and determines sentiment in each of the references using natural language processing (nlp) techniques. our sentiment analysis consists of 1) a topic specific feature term extraction, 2) sentiment extraction, and 3) (subject, sentiment) association by relationship analysis. sa utilizes two linguistic resources for the analysis: the sentiment lexicon and the sentiment pattern database. the performance of the algorithms was verified on online product review articles (digital camera and music reviews), and more general documents including general webpages and news articles citee id:127 citee title:a maximum entropy model for part-ofspeech tagging citee abstract:this paper presents a statistical model which trains from a corpus annotated with part-ofspeech tags and assigns them to previously unseen text with state-of-the-art accuracy(96.6%). the model can be classified as a maximum entropy model and simultaneously uses many contextual "features" to predict the pos tag. furthermore, this paper demonstrates the use of specialized features to model difficult tagging decisions, discusses the corpus consistency problems discovered during the implementation surrounding text:top 20 feature terms extracted by bbnp-l in the order of their rank list are more likely to be feature terms. we used the ratnaparkhi pos tagger[***]<1> to extract bnps. = 0 influence:3 type:3 pair index:1030 citer id:846 citer title:sentiment analyzer extractingsentiments about a given topic citer abstract:we present sentiment analyzer (sa) that extracts sentiment (or opinion) about a subject from online text documents. instead of classifying the sentiment of an entire document about a subject, sa detects all references to the given subject, and determines sentiment in each of the references using natural language processing (nlp) techniques. our sentiment analysis consists of 1) a topic specific feature term extraction, 2) sentiment extraction, and 3) (subject, sentiment) association by relationship analysis. sa utilizes two linguistic resources for the analysis: the sentiment lexicon and the sentiment pattern database. the performance of the algorithms was verified on online product review articles (digital camera and music reviews), and more general documents including general webpages and news articles citee id:507 citee title:emotion and style in 30-second television advertisements targeted at men, women, boys, and girls citee abstract:a program for objective textual analysis which incorporated measures of style, word emotionality, and word imagery, was used to score the verbal portion of 152 30-sec. television advertisements. this analysis indicated that advertisements directed at children were more active, longer, and less negative than those directed at adults. a comparison of advertisements directed at males and females regardless of age showed greater linguistic complexity (more words, fewer common words) when their text was directed at women and girls. each of the 13 stylistic and emotional measures used to describe advertisements produced at least one significant difference associated with the age or the sex of the target population or their interaction surrounding text:[20]<2> used the average semantic orientation of the phrases in the review. [***]<2> analysed emotional affect of various corpora computed as average of affect scores of individual affect terms in the articles. the sentiment classifiers often assumes 1) each document has only one subject, and 2) the subject of each document is known influence:3 type:3 pair index:1031 citer id:846 citer title:sentiment analyzer extractingsentiments about a given topic citer abstract:we present sentiment analyzer (sa) that extracts sentiment (or opinion) about a subject from online text documents. instead of classifying the sentiment of an entire document about a subject, sa detects all references to the given subject, and determines sentiment in each of the references using natural language processing (nlp) techniques. our sentiment analysis consists of 1) a topic specific feature term extraction, 2) sentiment extraction, and 3) (subject, sentiment) association by relationship analysis. sa utilizes two linguistic resources for the analysis: the sentiment lexicon and the sentiment pattern database. the performance of the algorithms was verified on online product review articles (digital camera and music reviews), and more general documents including general webpages and news articles citee id:201 citee title:affect analysis of text using fuzzy semantic typing citee abstract:we propose a novel, convenient fusion of natural-language processing and fuzzy logic techniques for analyzing affect content in free text; our main goals are fast analysis and visualization of affect content for decision-making. the primary linguistic resource for fuzzy semantic typing is the fuzzy affect lexicon, from which other important resources are generated, notably the fuzzy thesaurus and affect category groups. free text is tagged with affect categories from the lexicon, and the affect surrounding text:timations, they can be costly especially if a large volume of survey data is gathered. there has been extensive research on automatic text analysis for sentiment, such as sentiment classifiers[13, 6, 16, 2, 19]<2>, affect analysis[***, 21]<2>, automatic survey analysis[8, 16]<2>, opinion extraction[12]<2>, or recommender systems [18]<2>. these methods typically try to extract the overall sentiment revealed in a document, either positive or negative, or somewhere in between influence:3 type:3 pair index:1032 citer id:846 citer title:sentiment analyzer extractingsentiments about a given topic citer abstract:we present sentiment analyzer (sa) that extracts sentiment (or opinion) about a subject from online text documents. instead of classifying the sentiment of an entire document about a subject, sa detects all references to the given subject, and determines sentiment in each of the references using natural language processing (nlp) techniques. our sentiment analysis consists of 1) a topic specific feature term extraction, 2) sentiment extraction, and 3) (subject, sentiment) association by relationship analysis. sa utilizes two linguistic resources for the analysis: the sentiment lexicon and the sentiment pattern database. the performance of the algorithms was verified on online product review articles (digital camera and music reviews), and more general documents including general webpages and news articles citee id:779 citee title:phoaks: a system for sharing recommendations citee abstract:finding relevant, high-quality information on theworld-wide web is a difficult problem. phoaks (people helping one another know stuff) is an experimental system that addresses this problem through a collaborative filtering approach. phoaks works by automatically recognizing, tallying, and redistributing recommendations of web resources mined from usenet news messages. surrounding text:timations, they can be costly especially if a large volume of survey data is gathered. there has been extensive research on automatic text analysis for sentiment, such as sentiment classifiers[13, 6, 16, 2, 19]<2>, affect analysis[17, 21]<2>, automatic survey analysis[8, 16]<2>, opinion extraction[12]<2>, or recommender systems [***]<2>. these methods typically try to extract the overall sentiment revealed in a document, either positive or negative, or somewhere in between influence:3 type:3 pair index:1033 citer id:846 citer title:sentiment analyzer extractingsentiments about a given topic citer abstract:we present sentiment analyzer (sa) that extracts sentiment (or opinion) about a subject from online text documents. instead of classifying the sentiment of an entire document about a subject, sa detects all references to the given subject, and determines sentiment in each of the references using natural language processing (nlp) techniques. our sentiment analysis consists of 1) a topic specific feature term extraction, 2) sentiment extraction, and 3) (subject, sentiment) association by relationship analysis. sa utilizes two linguistic resources for the analysis: the sentiment lexicon and the sentiment pattern database. the performance of the algorithms was verified on online product review articles (digital camera and music reviews), and more general documents including general webpages and news articles citee id:694 citee title:learning subjective adjectives from corpora citee abstract:subjectivity tagging is distinguishing sentences used to present opinions and evaluations from sentences used to objectively present factual information. there are numerous applications for which subjectivity tagging is relevant, including information extraction and information retrieval. this paper identifies strong clues of subjectivity using the results of a method for clustering words according to distributional similarity (lin 1998), seeded by a small amount of detailed manual annotation surrounding text:[5]<2> developed an algorithm for automatically recognizing the semantic orientation of adjectives. [***]<2> identifies subjective adjectives (or sentiment adjectives) from corpora. past work on sentiment-based categorization of entire documents has often involved either the use of models inspired by cognitive linguistics [6, 16]<2> or the manual or semimanual construction of discriminant-word lexicons [2, 19]<2> influence:2 type:2 pair index:1034 citer id:846 citer title:sentiment analyzer extractingsentiments about a given topic citer abstract:we present sentiment analyzer (sa) that extracts sentiment (or opinion) about a subject from online text documents. instead of classifying the sentiment of an entire document about a subject, sa detects all references to the given subject, and determines sentiment in each of the references using natural language processing (nlp) techniques. our sentiment analysis consists of 1) a topic specific feature term extraction, 2) sentiment extraction, and 3) (subject, sentiment) association by relationship analysis. sa utilizes two linguistic resources for the analysis: the sentiment lexicon and the sentiment pattern database. the performance of the algorithms was verified on online product review articles (digital camera and music reviews), and more general documents including general webpages and news articles citee id:742 citee title:model-based feedback in the lanuage modeling approach to information retrieval citee abstract:the language modeling approach to retrieval has been shown to perform well empirically. one advantage of this new approach is its statistical foundations. however, feedback, as one important component in a retrieval system, has only been dealt with heuristically in this new retrieval approach: the original query is usually literally expanded by adding additional terms to it. such expansion-based feedback creates an inconsistent interpretation of the original and the expanded query. in this paper, we present a more principled approach to feedback in the language modeling approach. specifically, we treat feedback as updating the query language model based on the extra evidence carried by the feedback documents. such a model-based feedback strategy easily fits into an extension of the language modeling approach. we propose and evaluate two different approaches to updating a query language model based on feedback documents, one based on a generative probabilistic model of feedback documents and one based on minimization of the kl-divergence over feedback documents. experiment results show that both approaches are effective and outperform the rocchio feedback approach surrounding text:mixture model. this method is based on the mixture language model by zhai and laffertry[***]<2>: they assume that an observed documents d is generated by a mixture of the query model and the corpus language model. in our case, we may consider our language model as the mixture (or a linear combination) of the general web language model w (similar to the corpus language model) and a topic-specific language model t (similar to the query model): = w + ¦t where , are given and sum to 1 influence:2 type:1 pair index:1035 citer id:846 citer title:sentiment analyzer extractingsentiments about a given topic citer abstract:we present sentiment analyzer (sa) that extracts sentiment (or opinion) about a subject from online text documents. instead of classifying the sentiment of an entire document about a subject, sa detects all references to the given subject, and determines sentiment in each of the references using natural language processing (nlp) techniques. our sentiment analysis consists of 1) a topic specific feature term extraction, 2) sentiment extraction, and 3) (subject, sentiment) association by relationship analysis. sa utilizes two linguistic resources for the analysis: the sentiment lexicon and the sentiment pattern database. the performance of the algorithms was verified on online product review articles (digital camera and music reviews), and more general documents including general webpages and news articles citee id:532 citee title:exact maximum likelihood estimation for word mixtures citee abstract:the mixture model for generating document is a generative language model used in information retrieval. while using this model, there are situations that we need to find the maximum likelihood estimation of the density of one multinomial, given fixed mixture weight and the densities of the other multinomial surrounding text:the problem of finding t can be generalized as finding the maximum likelihood estimation of multinomial distribution t . zhang et al$[***]<2> developed an o(klog(k)) algorithm that computes the exact maximum likelihood estimation of the multinomial distribution of q in the following mixture model of multinomial distributions, p = (p1, p2, . influence:2 type:1 pair index:1036 citer id:863 citer title:topic sentiment mixture citer abstract:in this paper, we define the problem of topic-sentiment analysis on weblogs and propose a novel probabilistic model to capture the mixture of topics and sentiments simultaneously. the proposed topic-sentiment mixture (tsm) model can reveal the latent topical facets in a weblog collection, the subtopics in the results of an ad hoc query, and their associated sentiments. it could also provide general sentiment models that are applicable to any ad hoc topics. with a specifically designed hmm structure, the sentiment models and topic models estimated with tsm can be utilized to extract topic life cycles and sentiment dynamics. empirical experiments on different weblog datasets show that this approach is effective for modeling the topic facets and sentiments and extracting their dynamics from weblog collections. the tsm model is quite general; it can be applied to any text collections with a mixture of topics and sentiments, thus has many potential applications, such as search result summarization, opinion tracking, and user behavior prediction citee id:164 citee title:latent dirichlet allocation citee abstract:we describe latent dirichlet allocation (lda), a generative probabilistic model for collections of discrete data such as text corpora. lda is a three-level hierarchical bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. in the context of text modeling, the topic probabilities provide an explicit representation of a document. we present efficient approximate inference techniques based on variational methods and an em algorithm for empirical bayes parameter estimation. we report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic lsi model surrounding text:we assume that c covers a number of topics, or subtopics (also known as themes) and some related sentiments. following [9, ***, 16, 17]<1>, we further assume that there are k major topics (subtopics) in the documents, {1, 2, . influence:2 type:1 pair index:1037 citer id:863 citer title:topic sentiment mixture citer abstract:in this paper, we define the problem of topic-sentiment analysis on weblogs and propose a novel probabilistic model to capture the mixture of topics and sentiments simultaneously. the proposed topic-sentiment mixture (tsm) model can reveal the latent topical facets in a weblog collection, the subtopics in the results of an ad hoc query, and their associated sentiments. it could also provide general sentiment models that are applicable to any ad hoc topics. with a specifically designed hmm structure, the sentiment models and topic models estimated with tsm can be utilized to extract topic life cycles and sentiment dynamics. empirical experiments on different weblog datasets show that this approach is effective for modeling the topic facets and sentiments and extracting their dynamics from weblog collections. the tsm model is quite general; it can be applied to any text collections with a mixture of topics and sentiments, thus has many potential applications, such as search result summarization, opinion tracking, and user behavior prediction citee id:639 citee title:identifying sources of opinions with conditional random fields and extraction patterns citee abstract:recent systems have been developed for sentiment classification, opinion recognition, and opinion analysis (e.g., detecting polarity and strength). we pursue another aspect of opinion analysis: identifying the sources of opinions, emotions, and sentiments. we view this problem as an information extraction task and adopt a hybrid approach that combines conditional random fields (lafferty et al., 2001) and a variation of autoslog (riloff, 1996a). while crfs model source identification as a sequence tagging task, autoslog learns extraction patterns. our results show that the combination of these two methods performs better than either one alone. the resulting system identifies opinion sources with 79.3% precision and 59.5% recall using a head noun matching measure, and 81.2% precision and 60.6% recall using an overlap measure surrounding text:g. , [26, ***]<2>). the most common definition of the problem is a binary classification task of a sentence to either the positive or the negative polarity [23, 21]<2> influence:2 type:2 pair index:1038 citer id:863 citer title:topic sentiment mixture citer abstract:in this paper, we define the problem of topic-sentiment analysis on weblogs and propose a novel probabilistic model to capture the mixture of topics and sentiments simultaneously. the proposed topic-sentiment mixture (tsm) model can reveal the latent topical facets in a weblog collection, the subtopics in the results of an ad hoc query, and their associated sentiments. it could also provide general sentiment models that are applicable to any ad hoc topics. with a specifically designed hmm structure, the sentiment models and topic models estimated with tsm can be utilized to extract topic life cycles and sentiment dynamics. empirical experiments on different weblog datasets show that this approach is effective for modeling the topic facets and sentiments and extracting their dynamics from weblog collections. the tsm model is quite general; it can be applied to any text collections with a mixture of topics and sentiments, thus has many potential applications, such as search result summarization, opinion tracking, and user behavior prediction citee id:167 citee title:the predictive power of online chatter citee abstract:an increasing fraction of the global discourse is migrating online in the form of blogs, bulletin boards, web pages, wikis, editorials, and a dizzying array of new collaborative technologies. the migration has now proceeded to the point that topics reflecting certain individual products are sufficiently popular to allow targeted online tracking of the ebb and flow of chatter around these topics. based on an analysis of around half a million sales rank values for 2,340 books over a period of four months, and correlating postings in blogs, media, and web pages, we are able to draw several interesting conclusions.first, carefully hand-crafted queries produce matching postings whose volume predicts sales ranks. second, these queries can be automatically generated in many cases. and third, even though sales rank motion might be difficult to predict in general, algorithmic predictors can use online postings to successfully predict spikes in sales rank. surrounding text:technically, the task of mining user opinions fromweblogs boils down to sentiment analysis of blog data c identifying and extracting positive and negative opinions from blog articles. although much work has been done recently on blog mining [11, 7, ***, 15]<2>, most existing work aims at extracting and analyzing topical contents of blog articles without any copyright is held by the international world wide web conference committee (iw3c2). distribution of these papers is limited to classroom use, and personal use by others. the lack of sentiment analysis in such work often limits the effectiveness of the mining results. for example, in [***]<2>, a burst of blog mentions about a book has been shown to be correlated with a spike of sales of the book in amazon. com, however, a burst of criticism of a book is unlikely to indicate a growth of the book sales. e. , topic life cycles and corresponding sentiment dynamics) could potentially provide more in-depth understanding of the public opinions than [20]<2>, and yield more accurate predictions of user behavior than using the methods proposed in [***]<2> and [19]<2>. to achieve this goal, we can approximate these temporal patterns by partitioning documents into their corresponding time periods and computing the posterior probability of p(tj), p(tj , p ) and p(tj , n), where t is a time period. however, there are several lines of related work. weblogs have been attracting increasing attentions from researchers, who consider weblogs as a suitable test bed for many novel research problems and algorithms [11, 7, ***, 15, 19]<2>. much new research work has found applications to weblog analysis, such as community evolution [11]<2>, spatiotemporal text mining [15]<2>, opinion tracking [20, 15, 19]<2>, information propagation [7]<2>, and user behavior prediction [***]<2>. weblogs have been attracting increasing attentions from researchers, who consider weblogs as a suitable test bed for many novel research problems and algorithms [11, 7, ***, 15, 19]<2>. much new research work has found applications to weblog analysis, such as community evolution [11]<2>, spatiotemporal text mining [15]<2>, opinion tracking [20, 15, 19]<2>, information propagation [7]<2>, and user behavior prediction [***]<2>. mei and others introduced a mixture model to extract the subtopics in weblog collections, and track their distribution over time and locations [16]<2>. mei and others introduced a mixture model to extract the subtopics in weblog collections, and track their distribution over time and locations [16]<2>. gruhl and others [7]<2> proposed a model for information propagation and detect spikes in the diffusing topics in weblogs, and later use the burst of blog mentions to predict spikes of sales of this book in the near future [***]<2>. however, all these models tend to ignore the sentiments in the weblogs, and only capture the general www 2007 / track: data mining session: predictive modeling of web users 178 neutral thumbs up thumbs down influence:2 type:2 pair index:1039 citer id:863 citer title:topic sentiment mixture citer abstract:in this paper, we define the problem of topic-sentiment analysis on weblogs and propose a novel probabilistic model to capture the mixture of topics and sentiments simultaneously. the proposed topic-sentiment mixture (tsm) model can reveal the latent topical facets in a weblog collection, the subtopics in the results of an ad hoc query, and their associated sentiments. it could also provide general sentiment models that are applicable to any ad hoc topics. with a specifically designed hmm structure, the sentiment models and topic models estimated with tsm can be utilized to extract topic life cycles and sentiment dynamics. empirical experiments on different weblog datasets show that this approach is effective for modeling the topic facets and sentiments and extracting their dynamics from weblog collections. the tsm model is quite general; it can be applied to any text collections with a mixture of topics and sentiments, thus has many potential applications, such as search result summarization, opinion tracking, and user behavior prediction citee id:168 citee title:information diffusion through blogspace citee abstract:we study the dynamics of information propagation in environments of low-overhead personal publishing, using a large collection of weblogs over time as our example domain. we characterize and model this collection at two levels. first, we present a macroscopic characterization of topic propagation through our corpus, formalizing the notion of long-running "chatter" topics consisting recursively of "spike" topics generated by outside world events, or more rarely, by resonances within the community. second, we present a microscopic characterization of propagation from individual to individual, drawing on the theory of infectious diseases to model the flow. we propose, validate, and employ an algorithm to induce the underlying propagation network from a sequence of posts, and report on the results. surrounding text:technically, the task of mining user opinions fromweblogs boils down to sentiment analysis of blog data c identifying and extracting positive and negative opinions from blog articles. although much work has been done recently on blog mining [11, ***, 6, 15]<2>, most existing work aims at extracting and analyzing topical contents of blog articles without any copyright is held by the international world wide web conference committee (iw3c2). distribution of these papers is limited to classroom use, and personal use by others. however, there are several lines of related work. weblogs have been attracting increasing attentions from researchers, who consider weblogs as a suitable test bed for many novel research problems and algorithms [11, ***, 6, 15, 19]<2>. much new research work has found applications to weblog analysis, such as community evolution [11]<2>, spatiotemporal text mining [15]<2>, opinion tracking [20, 15, 19]<2>, information propagation [***]<2>, and user behavior prediction [6]<2>. weblogs have been attracting increasing attentions from researchers, who consider weblogs as a suitable test bed for many novel research problems and algorithms [11, ***, 6, 15, 19]<2>. much new research work has found applications to weblog analysis, such as community evolution [11]<2>, spatiotemporal text mining [15]<2>, opinion tracking [20, 15, 19]<2>, information propagation [***]<2>, and user behavior prediction [6]<2>. mei and others introduced a mixture model to extract the subtopics in weblog collections, and track their distribution over time and locations [16]<2>. mei and others introduced a mixture model to extract the subtopics in weblog collections, and track their distribution over time and locations [16]<2>. gruhl and others [***]<2> proposed a model for information propagation and detect spikes in the diffusing topics in weblogs, and later use the burst of blog mentions to predict spikes of sales of this book in the near future [6]<2>. however, all these models tend to ignore the sentiments in the weblogs, and only capture the general www 2007 / track: data mining session: predictive modeling of web users 178 neutral thumbs up thumbs down influence:2 type:2,3 pair index:1040 citer id:863 citer title:topic sentiment mixture citer abstract:in this paper, we define the problem of topic-sentiment analysis on weblogs and propose a novel probabilistic model to capture the mixture of topics and sentiments simultaneously. the proposed topic-sentiment mixture (tsm) model can reveal the latent topical facets in a weblog collection, the subtopics in the results of an ad hoc query, and their associated sentiments. it could also provide general sentiment models that are applicable to any ad hoc topics. with a specifically designed hmm structure, the sentiment models and topic models estimated with tsm can be utilized to extract topic life cycles and sentiment dynamics. empirical experiments on different weblog datasets show that this approach is effective for modeling the topic facets and sentiments and extracting their dynamics from weblog collections. the tsm model is quite general; it can be applied to any text collections with a mixture of topics and sentiments, thus has many potential applications, such as search result summarization, opinion tracking, and user behavior prediction citee id:303 citee title:clustering versus faceted categories for information exploration citee abstract:information seekers often express a desire for a user interface that organizes search results into meaningful groups, in order to help make sense of the results, and to help decide what to do next. a longitudinal study in which participants were provided with the ability to group search results found they changed their search habits in response to having the grouping mechanism available . there are many open research questions about how to generate useful groupings and how to design interfaces to support exploration using grouping. currently two methods are quite popular: clustering and faceted categorization. here, i describe both approaches and summarize their advantages and disadvantages based on the results of usability studies. surrounding text:opinmind [20]<2> summarizes the weblog search results with positive and negative categories. on the other hand, researchers also use facets to categorize the latent topics in search results [***]<2>. however, all this work ignores the correlation between topics and sentiments influence:3 type:2 pair index:1041 citer id:863 citer title:topic sentiment mixture citer abstract:in this paper, we define the problem of topic-sentiment analysis on weblogs and propose a novel probabilistic model to capture the mixture of topics and sentiments simultaneously. the proposed topic-sentiment mixture (tsm) model can reveal the latent topical facets in a weblog collection, the subtopics in the results of an ad hoc query, and their associated sentiments. it could also provide general sentiment models that are applicable to any ad hoc topics. with a specifically designed hmm structure, the sentiment models and topic models estimated with tsm can be utilized to extract topic life cycles and sentiment dynamics. empirical experiments on different weblog datasets show that this approach is effective for modeling the topic facets and sentiments and extracting their dynamics from weblog collections. the tsm model is quite general; it can be applied to any text collections with a mixture of topics and sentiments, thus has many potential applications, such as search result summarization, opinion tracking, and user behavior prediction citee id:427 citee title:probabilistic latent semantic indexing citee abstract:probabilistic latent semantic indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data. fitted from a training corpus of text documents by a generalization of the expectation maximization algorithm, the utilized model is able to deal with domain{specific synonymy as well as with polysemous words. in contrast to standard latent semantic indexing (lsi) by singular value decomposition, the probabilistic variant has a solid statistical foundation and defines a proper generative data model. retrieval experiments on a number of test collections indicate substantial performance gains over direct term matching methods as well as over lsi. in particular, the combination of models with di erent dimensionalities has proven to be advantageous surrounding text:we assume that c covers a number of topics, or subtopics (also known as themes) and some related sentiments. following [***, 1, 16, 17]<1>, we further assume that there are k major topics (subtopics) in the documents, {1, 2, . influence:1 type:1 pair index:1042 citer id:863 citer title:topic sentiment mixture citer abstract:in this paper, we define the problem of topic-sentiment analysis on weblogs and propose a novel probabilistic model to capture the mixture of topics and sentiments simultaneously. the proposed topic-sentiment mixture (tsm) model can reveal the latent topical facets in a weblog collection, the subtopics in the results of an ad hoc query, and their associated sentiments. it could also provide general sentiment models that are applicable to any ad hoc topics. with a specifically designed hmm structure, the sentiment models and topic models estimated with tsm can be utilized to extract topic life cycles and sentiment dynamics. empirical experiments on different weblog datasets show that this approach is effective for modeling the topic facets and sentiments and extracting their dynamics from weblog collections. the tsm model is quite general; it can be applied to any text collections with a mixture of topics and sentiments, thus has many potential applications, such as search result summarization, opinion tracking, and user behavior prediction citee id:864 citee title:viewing morphology as an inference process citee abstract:morphology is the area of linguistics concerned with the internal structure of words. information retrieval has generally not paid much attention to word structure, other than to account for some of the variability in word forms via the use of stemmers. this paper will describe our experiments to determine the importance of morphology, and the effect that it has on performance. we will also describe the role of morphological analysis in word sense disambiguation, and in identifying lexical surrounding text:data set # doc. time period query term ipod 2988 1/11/0511/01/06 ipod da vinci code 1000 1/26/0510/31/06 da+vinci+code table 2: basic statistics of the test data sets for all the weblog collections, krovetz stemmer [***]<1> is used to stem the text. 5 influence:3 type:3 pair index:1043 citer id:863 citer title:topic sentiment mixture citer abstract:in this paper, we define the problem of topic-sentiment analysis on weblogs and propose a novel probabilistic model to capture the mixture of topics and sentiments simultaneously. the proposed topic-sentiment mixture (tsm) model can reveal the latent topical facets in a weblog collection, the subtopics in the results of an ad hoc query, and their associated sentiments. it could also provide general sentiment models that are applicable to any ad hoc topics. with a specifically designed hmm structure, the sentiment models and topic models estimated with tsm can be utilized to extract topic life cycles and sentiment dynamics. empirical experiments on different weblog datasets show that this approach is effective for modeling the topic facets and sentiments and extracting their dynamics from weblog collections. the tsm model is quite general; it can be applied to any text collections with a mixture of topics and sentiments, thus has many potential applications, such as search result summarization, opinion tracking, and user behavior prediction citee id:172 citee title:on the bursty evolution of blogspace citee abstract:we propose two new tools to address the evolution of hyperlinked corpora. first, we define time graphs to extend the traditional notion of an evolving directed graph, capturing link creation as a point phenomenon in time. second, we develop definitions and algorithms for time-dense community tracking, to crystallize the notion of community evolution. we develop these tools in the context of blogspace , the space of weblogs (or blogs). our study involves approximately 750k links among 25k blogs. we create a time graph on these blogs by an automatic analysis of their internal time stamps. we then study the evolution of connected component structure and microscopic community structure in this time graph. we show that blogspace underwent a transition behavior around the end of 2001, and has been rapidly expanding over the past year, not just in metrics of scale, but also in metrics of community structure and connectedness. this expansion shows no sign of abating, although measures of connectedness must plateau within two years. by randomizing link destinations in blogspace, but retaining sources and timestamps, we introduce a concept of randomized blogspace . herein, we observe similar evolution of a giant component, but no corresponding increase in community structure. having demonstrated the formation of micro-communities over time, we then turn to the ongoing activity within active communities. we extend recent work of kleinberg to discover dense periods of "bursty" intra-community link creation. surrounding text:technically, the task of mining user opinions fromweblogs boils down to sentiment analysis of blog data c identifying and extracting positive and negative opinions from blog articles. although much work has been done recently on blog mining [***, 7, 6, 15]<2>, most existing work aims at extracting and analyzing topical contents of blog articles without any copyright is held by the international world wide web conference committee (iw3c2). distribution of these papers is limited to classroom use, and personal use by others. however, there are several lines of related work. weblogs have been attracting increasing attentions from researchers, who consider weblogs as a suitable test bed for many novel research problems and algorithms [***, 7, 6, 15, 19]<2>. much new research work has found applications to weblog analysis, such as community evolution [***]<2>, spatiotemporal text mining [15]<2>, opinion tracking [20, 15, 19]<2>, information propagation [7]<2>, and user behavior prediction [6]<2>. weblogs have been attracting increasing attentions from researchers, who consider weblogs as a suitable test bed for many novel research problems and algorithms [***, 7, 6, 15, 19]<2>. much new research work has found applications to weblog analysis, such as community evolution [***]<2>, spatiotemporal text mining [15]<2>, opinion tracking [20, 15, 19]<2>, information propagation [7]<2>, and user behavior prediction [6]<2>. mei and others introduced a mixture model to extract the subtopics in weblog collections, and track their distribution over time and locations [16]<2> influence:2 type:2 pair index:1044 citer id:863 citer title:topic sentiment mixture citer abstract:in this paper, we define the problem of topic-sentiment analysis on weblogs and propose a novel probabilistic model to capture the mixture of topics and sentiments simultaneously. the proposed topic-sentiment mixture (tsm) model can reveal the latent topical facets in a weblog collection, the subtopics in the results of an ad hoc query, and their associated sentiments. it could also provide general sentiment models that are applicable to any ad hoc topics. with a specifically designed hmm structure, the sentiment models and topic models estimated with tsm can be utilized to extract topic life cycles and sentiment dynamics. empirical experiments on different weblog datasets show that this approach is effective for modeling the topic facets and sentiments and extracting their dynamics from weblog collections. the tsm model is quite general; it can be applied to any text collections with a mixture of topics and sentiments, thus has many potential applications, such as search result summarization, opinion tracking, and user behavior prediction citee id:815 citee title:pachinko allocation: dag-structured mixture models of topic correlations citee abstract:latent dirichlet allocation (lda) and other related topic models are increasingly popular tools for summarization and manifold discovery in discrete data. however, lda does not capture correlations between topics. in this paper, we introduce the pachinko allocation model (pam), which captures arbitrary, nested, and possibly sparse correlations between topics using a directed acyclic graph (dag). the leaves of the dag represent individual words in the vocabulary, while each interior node represents a correlation among its children, which may be words or other interior nodes (topics). pam provides a exible alternative to recent work by blei and la erty (2006), which captures correlations only between pairs of topics. using text data from newsgroups, historic nips proceedings and other research paper corpora, we show improved performance of pam in document classification, likelihood of held-out data, the ability to support finer-grained topics, and topical keyword coherence. surrounding text:amixturemodel for theme and sentiment analysis 3. 1 the generation process a lot of previous work has shown the effectiveness of mixture of multinomial distributions (mixture language models) in extracting topics (themes, subtopics) from either plain text collections or contextualized collections [9, 1, 16, 15, 17, ***]<2>. however, none of this work models topics and sentiments simultaneously. they also did not provide a way to model sentiment dynamics. there is yet another line of research in text mining, which tries to model the mixture of topics (themes) in documents [9, 1, 16, 15, 17, ***]<2>. the mixture model we presented is along this line. however, we do notice www 2007 / track: data mining session: predictive modeling of web users 179 that the tsm model is a special case of some very general topic models, such as the cplsa model [17]<2>, which mixes themes with different views (topic, sentiment) and different coverages (sentiment coverages). the generation structure in figure 2 is also related to the general dag structure presented in [***]<2>. 7 influence:3 type:2 pair index:1045 citer id:863 citer title:topic sentiment mixture citer abstract:in this paper, we define the problem of topic-sentiment analysis on weblogs and propose a novel probabilistic model to capture the mixture of topics and sentiments simultaneously. the proposed topic-sentiment mixture (tsm) model can reveal the latent topical facets in a weblog collection, the subtopics in the results of an ad hoc query, and their associated sentiments. it could also provide general sentiment models that are applicable to any ad hoc topics. with a specifically designed hmm structure, the sentiment models and topic models estimated with tsm can be utilized to extract topic life cycles and sentiment dynamics. empirical experiments on different weblog datasets show that this approach is effective for modeling the topic facets and sentiments and extracting their dynamics from weblog collections. the tsm model is quite general; it can be applied to any text collections with a mixture of topics and sentiments, thus has many potential applications, such as search result summarization, opinion tracking, and user behavior prediction citee id:114 citee title:opinion observer: analyzing and comparing opinions on the web citee abstract:the web has become an excellent source for gathering consumer opinions. there are now numerous web sites containing such opinions, e.g., customer reviews of products, forums, discussion groups, and blogs. this paper focuses on online customer reviews of products. it makes two contributions. first, it proposes a novel framework for analyzing and comparing consumer opinions of competing products. a prototype system called opinion observer is also implemented. the system is such that with a single glance of its visualization, the user is able to clearly see the strengths and weaknesses of each product in the minds of consumers in terms of various product features. this comparison is useful to both potential customers and product manufacturers. for a potential customer, he/she can see a visual side-by-side and feature-by-feature comparison of consumer opinions on these products, which helps him/her to decide which product to buy. for a product manufacturer, the comparison enables it to easily gather marketing intelligence and product benchmarking information. second, a new technique based on language pattern mining is proposed to extract product features from pros and cons in a particular type of reviews. such features form the basis for the above comparison. experimental results show that the technique is highly effective and outperform existing methods significantly. surrounding text:for example, a user may like the price and fuel efficiency of a new toyota camry, but dislike its power and safety aspects. indeed, people tend to have different opinions about different features of a product [28, ***]<2>. as another example, a voter may agree with some points made by a presidential candidate, but disagree with some others influence:2 type:3,2 pair index:1046 citer id:863 citer title:topic sentiment mixture citer abstract:in this paper, we define the problem of topic-sentiment analysis on weblogs and propose a novel probabilistic model to capture the mixture of topics and sentiments simultaneously. the proposed topic-sentiment mixture (tsm) model can reveal the latent topical facets in a weblog collection, the subtopics in the results of an ad hoc query, and their associated sentiments. it could also provide general sentiment models that are applicable to any ad hoc topics. with a specifically designed hmm structure, the sentiment models and topic models estimated with tsm can be utilized to extract topic life cycles and sentiment dynamics. empirical experiments on different weblog datasets show that this approach is effective for modeling the topic facets and sentiments and extracting their dynamics from weblog collections. the tsm model is quite general; it can be applied to any text collections with a mixture of topics and sentiments, thus has many potential applications, such as search result summarization, opinion tracking, and user behavior prediction citee id:151 citee title:a probabilistic approach to spatiotemporal theme pattern mining on weblogs citee abstract:mining subtopics from weblogs and analyzing their spatiotemporal patterns have applications in multiple domains. in this paper, we define the novel problem of mining spatiotemporal theme patterns from weblogs and propose a novel probabilistic approach to model the subtopic themes and spatiotemporal theme patterns simultaneously. the proposed model discovers spatiotemporal theme patterns by (1) extracting common themes from weblogs; (2) generating theme life cycles for each given location; and (3) generating theme snapshots for each given time period. evolution of patterns can be discovered by comparative analysis of theme life cycles and theme snapshots. experiments on three different data sets show that the proposed approach can discover interesting spatiotemporal theme patterns effectively. the proposed probabilistic model is general and can be used for spatiotemporal text mining on any domain with time and location information. surrounding text:technically, the task of mining user opinions fromweblogs boils down to sentiment analysis of blog data c identifying and extracting positive and negative opinions from blog articles. although much work has been done recently on blog mining [11, 7, 6, ***]<2>, most existing work aims at extracting and analyzing topical contents of blog articles without any copyright is held by the international world wide web conference committee (iw3c2). distribution of these papers is limited to classroom use, and personal use by others. first, it is not immediately clear how to model topics and sentiments simultaneously with a mixture model. no existing topic extraction work [9, 1, 16, ***, 17]<2> could extract sentiment models from text, while no sentiment classification algorithm could model a mixture of topics simultaneously. second, it is unclear how to obtain sentiment models that are independent of specific contents of topics and can be generally applicable to any collection representing a users ad hoc information need. amixturemodel for theme and sentiment analysis 3. 1 the generation process a lot of previous work has shown the effectiveness of mixture of multinomial distributions (mixture language models) in extracting topics (themes, subtopics) from either plain text collections or contextualized collections [9, 1, 16, ***, 17, 12]<2>. however, none of this work models topics and sentiments simultaneously. to model both topics and sentiments, we also use a mixture of multinomials, but extend the model structure to include two sentiment models to naturally capture sentiments. in the previous work [***, 17]<2>, the words in a blog article are classified into two categories: (1) common english words (e. g. this approach has the limitation that these posterior distributions are not well defined, because the time variable t is nowhere involved in the original model. an alternative approach would be to model the time variable t explicitly in the model as in [***, 17]<2>, but this would bring in many more free parameters to the model, making it harder to estimate all the parameters reliably. defining a good partition of the time line is also a challenging problem, since too coarse a www 2007 / track: data mining session: predictive modeling of web users 175 partition would miss many bursting patterns, while too fine granularity a time period may not be estimated reliably because of data sparseness. however, there are several lines of related work. weblogs have been attracting increasing attentions from researchers, who consider weblogs as a suitable test bed for many novel research problems and algorithms [11, 7, 6, ***, 19]<2>. much new research work has found applications to weblog analysis, such as community evolution [11]<2>, spatiotemporal text mining [***]<2>, opinion tracking [20, ***, 19]<2>, information propagation [7]<2>, and user behavior prediction [6]<2>. weblogs have been attracting increasing attentions from researchers, who consider weblogs as a suitable test bed for many novel research problems and algorithms [11, 7, 6, ***, 19]<2>. much new research work has found applications to weblog analysis, such as community evolution [11]<2>, spatiotemporal text mining [***]<2>, opinion tracking [20, ***, 19]<2>, information propagation [7]<2>, and user behavior prediction [6]<2>. mei and others introduced a mixture model to extract the subtopics in weblog collections, and track their distribution over time and locations [16]<2>. weblogs have been attracting increasing attentions from researchers, who consider weblogs as a suitable test bed for many novel research problems and algorithms [11, 7, 6, ***, 19]<2>. much new research work has found applications to weblog analysis, such as community evolution [11]<2>, spatiotemporal text mining [***]<2>, opinion tracking [20, ***, 19]<2>, information propagation [7]<2>, and user behavior prediction [6]<2>. mei and others introduced a mixture model to extract the subtopics in weblog collections, and track their distribution over time and locations [16]<2>. they also requires sentiment training data for every topic, or manually input sentiment keywords, while we can learn general sentiment models applicable to ad hoc topics. most opinion extraction work tries to find general opinions on a given topic but did not distinguish sentiments [28, ***]<2>. liu and others extracted product features and opinion features for a product, thus were able to provide sentiments for different features of a product. they also did not provide a way to model sentiment dynamics. there is yet another line of research in text mining, which tries to model the mixture of topics (themes) in documents [9, 1, 16, ***, 17, 12]<2>. the mixture model we presented is along this line influence:2 type:2 pair index:1047 citer id:863 citer title:topic sentiment mixture citer abstract:in this paper, we define the problem of topic-sentiment analysis on weblogs and propose a novel probabilistic model to capture the mixture of topics and sentiments simultaneously. the proposed topic-sentiment mixture (tsm) model can reveal the latent topical facets in a weblog collection, the subtopics in the results of an ad hoc query, and their associated sentiments. it could also provide general sentiment models that are applicable to any ad hoc topics. with a specifically designed hmm structure, the sentiment models and topic models estimated with tsm can be utilized to extract topic life cycles and sentiment dynamics. empirical experiments on different weblog datasets show that this approach is effective for modeling the topic facets and sentiments and extracting their dynamics from weblog collections. the tsm model is quite general; it can be applied to any text collections with a mixture of topics and sentiments, thus has many potential applications, such as search result summarization, opinion tracking, and user behavior prediction citee id:419 citee title:discovering evolutionary theme patterns from text: an exploration of temporal text mining citee abstract:temporal text mining (ttm) is concerned with discovering temporal patterns in text information collected over time. since most text information bears some time stamps, ttm has many applications in multiple domains, such as summarizing events in news articles and revealing research trends in scientific literature. in this paper, we study a particular ttm task -- discovering and summarizing the evolutionary patterns of themes in a text stream. we define this new text mining problem and present general probabilistic methods for solving this problem through (1) discovering latent themes from text; (2) constructing an evolution graph of themes; and (3) analyzing life cycles of themes. evaluation of the proposed methods on two different domains (i.e., news articles and literature) shows that the proposed methods can discover interesting evolutionary theme patterns effectively. surrounding text:we assume that c covers a number of topics, or subtopics (also known as themes) and some related sentiments. following [9, 1, ***, 17]<1>, we further assume that there are k major topics (subtopics) in the documents, {1, 2, .. for this purpose, we introduce two additional concepts, topic life cycle and sentiment dynamics as follows. definition 4 (topic life cycle) a topic life cycle, also known as a theme life cycle in [***]<1>, is a time series representing the strength distribution of the neutral contents of a topic over the time line. the strength can be measured based on either the amount of text which a topic can explain [***]<1> or the relative strength of topics in a time period [15, 17]<1>. definition 4 (topic life cycle) a topic life cycle, also known as a theme life cycle in [***]<1>, is a time series representing the strength distribution of the neutral contents of a topic over the time line. the strength can be measured based on either the amount of text which a topic can explain [***]<1> or the relative strength of topics in a time period [15, 17]<1>. in this paper, we follow [***] and model the topic life cycles with the amount of document content that is generated with each topic model in different time periods. , nano, price, mini in the documents about ipod). the common english words are captured with a background component model [28, ***, 15]<1>, and the topical words are captured with topic models. in our topic-sentiment model, we extend the categories for the topical words in existing approaches. b) k x j=1 dj (ffij,d,f p(wj ) + ffij,d,p p(wp ) + ffij,d,np(wn))] where c(w : d) is the count of word w in document d, b is the probability of choosing b, dj is the probability of choosing the j-th topic in document d, and {ffij,d,f , ffij,d,p , ffij,d,n} is the sentiment coverage of topic j in document d, as defined in section 2. similar to existing work [28, ***, 15, 17]<1>, we also regularize this model by fixing some parameters. b is set to an empirical constant between 0 and 1, which indicates how much noise that we believe exists in the weblog collection. defining a good partition of the time line is also a challenging problem, since too coarse a www 2007 / track: data mining session: predictive modeling of web users 175 partition would miss many bursting patterns, while too fine granularity a time period may not be estimated reliably because of data sparseness. in this work, we present another approach to extract topic life cycles and sentiment dynamics, which is similar to the method used in [***]<1>. specifically, we use a hidden markov model (hmm) to tag every word in the collection with a topic and sentiment polarity. we first sort the documents with their time stamps, and convert the whole collection into a long sequence of words. on the surface, it appears that we could follow [***]<1> and construct an hmm with each state corresponding to a topic model (including the background model), and set the output probability of state j to p(wj). a topic state can either stay on itself or transit to some other topic states through the background state. once all the parameters are estimated, we use the viterbi algorithm to decode the collection sequence. finally, as in [***]<1>, we compute the topic life cycles and sentiment dynamics by counting the number of words labeled with the corresponding state over time. 5. such data do not need to have sentiment labels, but should have time stamps, and be able to represent users ad hoc information needs. following [***]<1>, we construct these data sets by submitting time-bounded queries to google blog search 1 and collect the blog entries returned. we restrict the search domain to spaces influence:1 type:1 pair index:1048 citer id:863 citer title:topic sentiment mixture citer abstract:in this paper, we define the problem of topic-sentiment analysis on weblogs and propose a novel probabilistic model to capture the mixture of topics and sentiments simultaneously. the proposed topic-sentiment mixture (tsm) model can reveal the latent topical facets in a weblog collection, the subtopics in the results of an ad hoc query, and their associated sentiments. it could also provide general sentiment models that are applicable to any ad hoc topics. with a specifically designed hmm structure, the sentiment models and topic models estimated with tsm can be utilized to extract topic life cycles and sentiment dynamics. empirical experiments on different weblog datasets show that this approach is effective for modeling the topic facets and sentiments and extracting their dynamics from weblog collections. the tsm model is quite general; it can be applied to any text collections with a mixture of topics and sentiments, thus has many potential applications, such as search result summarization, opinion tracking, and user behavior prediction citee id:139 citee title:a mixture model for contextual text mining citee abstract:contextual text mining is concerned with extracting topical themes from a text collection with context information (e.g., time and location) and comparing/analyzing the variations of themes over different contexts. since the topics covered in a document are usually related to the context of the document, analyzing topical themes within context can potentially reveal many interesting theme patterns. in this paper, we generalize some of these models proposed in the previous work and we propose a new general probabilistic model for contextual text mining that can cover several existing models as special cases. specifically, we extend the probabilistic latent semantic analysis (plsa) model by introducing context variables to model the context of a document. the proposed mixture model, called contextual probabilistic latent semantic analysis (cplsa) model, can be applied to many interesting mining tasks, such as temporal text mining, spatiotemporal text mining, author-topic analysis, and cross-collection comparative analysis. empirical experiments show that the proposed mixture model can discover themes and their contextual variations effectively surrounding text:we assume that c covers a number of topics, or subtopics (also known as themes) and some related sentiments. following [9, 1, 16, ***]<1>, we further assume that there are k major topics (subtopics) in the documents, {1, 2, .. definition 4 (topic life cycle) a topic life cycle, also known as a theme life cycle in [16]<1>, is a time series representing the strength distribution of the neutral contents of a topic over the time line. the strength can be measured based on either the amount of text which a topic can explain [16]<1> or the relative strength of topics in a time period [15, ***]<1>. in this paper, we follow [16] and model the topic life cycles with the amount of document content that is generated with each topic model in different time periods. b) k x j=1 dj (ffij,d,f p(wj ) + ffij,d,p p(wp ) + ffij,d,np(wn))] where c(w : d) is the count of word w in document d, b is the probability of choosing b, dj is the probability of choosing the j-th topic in document d, and {ffij,d,f , ffij,d,p , ffij,d,n} is the sentiment coverage of topic j in document d, as defined in section 2. similar to existing work [28, 16, 15, ***]<1>, we also regularize this model by fixing some parameters. b is set to an empirical constant between 0 and 1, which indicates how much noise that we believe exists in the weblog collection influence:1 type:1 pair index:1049 citer id:863 citer title:topic sentiment mixture citer abstract:in this paper, we define the problem of topic-sentiment analysis on weblogs and propose a novel probabilistic model to capture the mixture of topics and sentiments simultaneously. the proposed topic-sentiment mixture (tsm) model can reveal the latent topical facets in a weblog collection, the subtopics in the results of an ad hoc query, and their associated sentiments. it could also provide general sentiment models that are applicable to any ad hoc topics. with a specifically designed hmm structure, the sentiment models and topic models estimated with tsm can be utilized to extract topic life cycles and sentiment dynamics. empirical experiments on different weblog datasets show that this approach is effective for modeling the topic facets and sentiments and extracting their dynamics from weblog collections. the tsm model is quite general; it can be applied to any text collections with a mixture of topics and sentiments, thus has many potential applications, such as search result summarization, opinion tracking, and user behavior prediction citee id:754 citee title:moodviews: tools for blog mood analysis citee abstract:we demonstrate a system for tracking and analyzing moods of bloggers worldwide, as reflected in the largest blogging community, livejournal. our system collects thousands of blog posts every hour, performs various analyses on the posts and presents the results graphically. surrounding text:for example, opinmind [20]<2> is a commercial weblog search engine which can categorize the search results into positive and negative opinions. mishne and others analyze the sentiments [***]<2> and moods [19]<2> inweblogs, and use the temporal patterns of sentiments to predict the book sales as opposed to simple blog mentions. however, a common deficiency of all this work is that the proposed approaches extract only the overall sentiment of a query or a blog article, but can neither distinguish different subtopics within a blog article, nor analyze the sentiment of a subtopic. however, all this work ignores the correlation between topics and sentiments. this limitation is shared with other sentiment analysis work such as [***]<2>. sentiment classification has been a challenging topic in natural language processing (see e influence:2 type:2 pair index:1050 citer id:863 citer title:topic sentiment mixture citer abstract:in this paper, we define the problem of topic-sentiment analysis on weblogs and propose a novel probabilistic model to capture the mixture of topics and sentiments simultaneously. the proposed topic-sentiment mixture (tsm) model can reveal the latent topical facets in a weblog collection, the subtopics in the results of an ad hoc query, and their associated sentiments. it could also provide general sentiment models that are applicable to any ad hoc topics. with a specifically designed hmm structure, the sentiment models and topic models estimated with tsm can be utilized to extract topic life cycles and sentiment dynamics. empirical experiments on different weblog datasets show that this approach is effective for modeling the topic facets and sentiments and extracting their dynamics from weblog collections. the tsm model is quite general; it can be applied to any text collections with a mixture of topics and sentiments, thus has many potential applications, such as search result summarization, opinion tracking, and user behavior prediction citee id:817 citee title:predicting movie sales from blogger sentiment citee abstract:the volume of discussion about a product in weblogs has recently been shown to correlate with the products financial performance. in this paper, we study whether applying sentiment analysis methods to weblog data results in better correlation than volume only, in the domain of movies. our main finding is that positive sentiment is indeed a better predictor for movie success when applied to a limited context around references to the movie in weblogs, posted prior to its release. surrounding text:for example, opinmind [20]<2> is a commercial weblog search engine which can categorize the search results into positive and negative opinions. mishne and others analyze the sentiments [18]<2> and moods [***]<2> inweblogs, and use the temporal patterns of sentiments to predict the book sales as opposed to simple blog mentions. however, a common deficiency of all this work is that the proposed approaches extract only the overall sentiment of a query or a blog article, but can neither distinguish different subtopics within a blog article, nor analyze the sentiment of a subtopic. e. , topic life cycles and corresponding sentiment dynamics) could potentially provide more in-depth understanding of the public opinions than [20]<2>, and yield more accurate predictions of user behavior than using the methods proposed in [6]<2> and [***]<2>. to achieve this goal, we can approximate these temporal patterns by partitioning documents into their corresponding time periods and computing the posterior probability of p(tj), p(tj , p ) and p(tj , n), where t is a time period. however, there are several lines of related work. weblogs have been attracting increasing attentions from researchers, who consider weblogs as a suitable test bed for many novel research problems and algorithms [11, 7, 6, 15, ***]<2>. much new research work has found applications to weblog analysis, such as community evolution [11]<2>, spatiotemporal text mining [15]<2>, opinion tracking [20, 15, ***]<2>, information propagation [7]<2>, and user behavior prediction [6]<2>. weblogs have been attracting increasing attentions from researchers, who consider weblogs as a suitable test bed for many novel research problems and algorithms [11, 7, 6, 15, ***]<2>. much new research work has found applications to weblog analysis, such as community evolution [11]<2>, spatiotemporal text mining [15]<2>, opinion tracking [20, 15, ***]<2>, information propagation [7]<2>, and user behavior prediction [6]<2>. mei and others introduced a mixture model to extract the subtopics in weblog collections, and track their distribution over time and locations [16]<2> influence:2 type:2,3 pair index:1051 citer id:863 citer title:topic sentiment mixture citer abstract:in this paper, we define the problem of topic-sentiment analysis on weblogs and propose a novel probabilistic model to capture the mixture of topics and sentiments simultaneously. the proposed topic-sentiment mixture (tsm) model can reveal the latent topical facets in a weblog collection, the subtopics in the results of an ad hoc query, and their associated sentiments. it could also provide general sentiment models that are applicable to any ad hoc topics. with a specifically designed hmm structure, the sentiment models and topic models estimated with tsm can be utilized to extract topic life cycles and sentiment dynamics. empirical experiments on different weblog datasets show that this approach is effective for modeling the topic facets and sentiments and extracting their dynamics from weblog collections. the tsm model is quite general; it can be applied to any text collections with a mixture of topics and sentiments, thus has many potential applications, such as search result summarization, opinion tracking, and user behavior prediction citee id:174 citee title:a sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts citee abstract:sentiment analysis seeks to identify the viewpoint(s) underlying a text span; an example application is classifying a movie review as "thumbs up" or "thumbs down". to determine this sentiment polarity, we propose a novel machine-learning method that applies text-categorization techniques to just the subjective portions of the document. extracting these portions can be implemented using efficient techniques for finding minimum cuts in graphs; this greatly facilitates incorporation of cross-sentence contextual constraints surrounding text:, k}, each being characterized by a multinomial distribution over all the words in our vocabulary (also known as a unigram language model). following [23, ***, 13]<1>, we assume that there are two sentiment polarities in weblog articles, the positive and the negative sentiment. the two sentiments are associated with each topic in a document, representing the positive and negative opinions about the topic influence:2 type:2 pair index:1052 citer id:863 citer title:topic sentiment mixture citer abstract:in this paper, we define the problem of topic-sentiment analysis on weblogs and propose a novel probabilistic model to capture the mixture of topics and sentiments simultaneously. the proposed topic-sentiment mixture (tsm) model can reveal the latent topical facets in a weblog collection, the subtopics in the results of an ad hoc query, and their associated sentiments. it could also provide general sentiment models that are applicable to any ad hoc topics. with a specifically designed hmm structure, the sentiment models and topic models estimated with tsm can be utilized to extract topic life cycles and sentiment dynamics. empirical experiments on different weblog datasets show that this approach is effective for modeling the topic facets and sentiments and extracting their dynamics from weblog collections. the tsm model is quite general; it can be applied to any text collections with a mixture of topics and sentiments, thus has many potential applications, such as search result summarization, opinion tracking, and user behavior prediction citee id:117 citee title:seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales citee abstract:we address the rating-inference problem, wherein rather than simply decide whether a review is "thumbs up" or "thumbs down", as in previous sentiment analysis work, one must determine an author's evaluation with respect to a multi-point scale (e.g., one to five "stars"). this task represents an interesting twist on standard multi-class text categorization because there are several different degrees of similarity between class labels; for example, "three stars" is intuitively closer to "four stars" than to "one star". we first evaluate human performance at the task. then, we apply a meta-algorithm, based on a metric labeling formulation of the problem, that alters a given n-ary classifier's output in an explicit attempt to ensure that similar items receive similar labels. we show that the meta-algorithm can provide significant improvements over both multi-class and regression versions of svms when we employ a novel similarity measure appropriate to the problem. surrounding text:since traditional text categorization methods perform poorly on sentiment classification [23]<2>, pang and lee proposed a method using mincut algorithm to extract sentiments and subjective summarization for movie reviews [21]<2>. in some recent work, the definition of sentiment classification problem is generalized into a rating scale [***]<2>. the goal of this line of work is to improve the classification accuracy, while we aim at mining useful information (topic/sentiment models, sentiment dynamics) from weblogs influence:2 type:2 pair index:1053 citer id:863 citer title:topic sentiment mixture citer abstract:in this paper, we define the problem of topic-sentiment analysis on weblogs and propose a novel probabilistic model to capture the mixture of topics and sentiments simultaneously. the proposed topic-sentiment mixture (tsm) model can reveal the latent topical facets in a weblog collection, the subtopics in the results of an ad hoc query, and their associated sentiments. it could also provide general sentiment models that are applicable to any ad hoc topics. with a specifically designed hmm structure, the sentiment models and topic models estimated with tsm can be utilized to extract topic life cycles and sentiment dynamics. empirical experiments on different weblog datasets show that this approach is effective for modeling the topic facets and sentiments and extracting their dynamics from weblog collections. the tsm model is quite general; it can be applied to any text collections with a mixture of topics and sentiments, thus has many potential applications, such as search result summarization, opinion tracking, and user behavior prediction citee id:118 citee title:thumbs up? sentiment classification using machine learning techniques citee abstract:we consider the problem of classifying documents not by topic, but by overall sentiment, e.g., determining whether a review is positive or negative. using movie reviews as data, we find that standard machine learning techniques definitively outperform human-produced baselines. however, the three machine learning methods we employed (naive bayes, maximum entropy classification, and support vector machines) do not perform as well on sentiment classification as on traditional topic-based categorization. we conclude by examining factors that make the sentiment classification problem more challenging. surrounding text:, k}, each being characterized by a multinomial distribution over all the words in our vocabulary (also known as a unigram language model). following [***, 21, 13]<1>, we assume that there are two sentiment polarities in weblog articles, the positive and the negative sentiment. the two sentiments are associated with each topic in a document, representing the positive and negative opinions about the topic. 3 defining model priors the prior distribution should tell the tsm what the sentiment models should look like in the working collection. this knowledge may be obtained from domain specific lexicons, or training data in this domain as in [***]<1>. however, it is impossible to have such knowledge or training data for every ad hoc topics, or queries influence:2 type:2 pair index:1054 citer id:863 citer title:topic sentiment mixture citer abstract:in this paper, we define the problem of topic-sentiment analysis on weblogs and propose a novel probabilistic model to capture the mixture of topics and sentiments simultaneously. the proposed topic-sentiment mixture (tsm) model can reveal the latent topical facets in a weblog collection, the subtopics in the results of an ad hoc query, and their associated sentiments. it could also provide general sentiment models that are applicable to any ad hoc topics. with a specifically designed hmm structure, the sentiment models and topic models estimated with tsm can be utilized to extract topic life cycles and sentiment dynamics. empirical experiments on different weblog datasets show that this approach is effective for modeling the topic facets and sentiments and extracting their dynamics from weblog collections. the tsm model is quite general; it can be applied to any text collections with a mixture of topics and sentiments, thus has many potential applications, such as search result summarization, opinion tracking, and user behavior prediction citee id:189 citee title:a tutorial on hidden markov models and selected applications in speech recognition citee abstract:this tutorial provides an overview of the basic theory of hidden markov models (hmms) as originated by l.e. baum and t. petrie (1966) and gives practical details on methods of implementation of the theory along with a description of selected applications of the theory to distinct problems in speech recognition. results from a number of original sources are combined to provide a single source of acquiring the background required to pursue further this area of research. the author first reviews the theory of discrete markov chains and shows how the concept of hidden states, where the observation is a probabilistic function of the state, can be used effectively. the theory is illustrated with two simple examples, namely coin-tossing, and the classic balls-in-urns system. three fundamental problems of hmms are noted and several practical techniques for solving these problems are given. the various types of hmms that have been studied, including ergodic as well as left-right models, are described surrounding text:a topic state can either stay on itself or transit to some other topic states through the background state. the system can learn (from our collection) the transition probabilities with the baum-welch algorithm [***]<1> and decode the collection sequence with the viterbi algorithm [***]<1>. we can easily model sentiments by adding two sentiment states to the hmm. a topic state can either stay on itself or transit to some other topic states through the background state. the system can learn (from our collection) the transition probabilities with the baum-welch algorithm [***]<1> and decode the collection sequence with the viterbi algorithm [***]<1>. we can easily model sentiments by adding two sentiment states to the hmm influence:2 type:1 pair index:1055 citer id:863 citer title:topic sentiment mixture citer abstract:in this paper, we define the problem of topic-sentiment analysis on weblogs and propose a novel probabilistic model to capture the mixture of topics and sentiments simultaneously. the proposed topic-sentiment mixture (tsm) model can reveal the latent topical facets in a weblog collection, the subtopics in the results of an ad hoc query, and their associated sentiments. it could also provide general sentiment models that are applicable to any ad hoc topics. with a specifically designed hmm structure, the sentiment models and topic models estimated with tsm can be utilized to extract topic life cycles and sentiment dynamics. empirical experiments on different weblog datasets show that this approach is effective for modeling the topic facets and sentiments and extracting their dynamics from weblog collections. the tsm model is quite general; it can be applied to any text collections with a mixture of topics and sentiments, thus has many potential applications, such as search result summarization, opinion tracking, and user behavior prediction citee id:834 citee title:regularized estimation of mixture models for robust pseudo-relevance feedback citee abstract:pseudo-relevance feedback has proven to be an effective strategy for improving retrieval accuracy in all retrieval models. however the performance of existing pseudo feedback methods is often affected significantly by some parameters, such as the number of feedback documents to use and the relative weight of original query terms; these parameters generally have to be set by trial-and-error without any guidance. in this paper, we present a more robust method for pseudo feedback based on statistical language models. our main idea is to integrate the original query with feedback documents in a single probabilistic mixture model and regularize the estimation of the language model parameters in the model so that the information in the feedback documents can be gradually added to the original query. unlike most existing feedback methods, our new method has no parameter to tune. experiment results on two representative data sets show that the new method is significantly more robust than a state-of-the-art baseline language modeling approach for feedback with comparable or better retrieval accuracy. surrounding text:2 to incorporate the pseudo counts given by the prior [14]<1>. the new m-step updating formulas are: p(n+1)(wp ) = p p(wp ) +pd2c pk j=1 c(w, d)p(zd,w,j,p = 1) p +pw02v pd2c pk j=1 c(w0, d)p(zd,w0,j,p = 1) p(n+1)(wn) = np(wn) +pd2c pk j=1 c(w, d)p(zd,w,j,n = 1) n +pw02v pd2c pk j=1 c(w0, d)p(zd,w0,j,n = 1) p(n+1)(wj) = jp(wj ) +pd2c c(w, d)p(zd,w,j,f = 1) j +pw02v pd2c c(w0, d)p(zd,w0,j,f = 1) the parameters 0s can be either empirically set to constants, or set through regularized estimation [***]<1>, in which we would start with very large 0s and then gradually discount 0s in each em iteration until some stopping condition is satisfied. 3. we expect the topic models extracted be unbiased towards sentiment polarities, which simply represent the neutral contents of the topics. in the experiments, we set the initial values of 0s reasonably large (>10,000), and use the regularized estimation strategy in [***]<1> to gradually decay the 0s. b is empirically set between 0 influence:2 type:1 pair index:1056 citer id:863 citer title:topic sentiment mixture citer abstract:in this paper, we define the problem of topic-sentiment analysis on weblogs and propose a novel probabilistic model to capture the mixture of topics and sentiments simultaneously. the proposed topic-sentiment mixture (tsm) model can reveal the latent topical facets in a weblog collection, the subtopics in the results of an ad hoc query, and their associated sentiments. it could also provide general sentiment models that are applicable to any ad hoc topics. with a specifically designed hmm structure, the sentiment models and topic models estimated with tsm can be utilized to extract topic life cycles and sentiment dynamics. empirical experiments on different weblog datasets show that this approach is effective for modeling the topic facets and sentiments and extracting their dynamics from weblog collections. the tsm model is quite general; it can be applied to any text collections with a mixture of topics and sentiments, thus has many potential applications, such as search result summarization, opinion tracking, and user behavior prediction citee id:236 citee title:annotating expressions of opinions and emotions in language citee abstract:this paper describes a corpus annotation project to study issues in the manual annotation of opinions, emotions, sentiments, speculations, evaluations and other private states in language. the resulting corpus annotation scheme is described, as well as examples of its use. in addition, the manual annotation process and the results of an inter-annotator agreement study on a 10,000-sentence corpus of articles drawn from the world press are presented. surrounding text:g. , [***, 2]<2>). the most common definition of the problem is a binary classification task of a sentence to either the positive or the negative polarity [23, 21]<2> influence:2 type:2 pair index:1057 citer id:863 citer title:topic sentiment mixture citer abstract:in this paper, we define the problem of topic-sentiment analysis on weblogs and propose a novel probabilistic model to capture the mixture of topics and sentiments simultaneously. the proposed topic-sentiment mixture (tsm) model can reveal the latent topical facets in a weblog collection, the subtopics in the results of an ad hoc query, and their associated sentiments. it could also provide general sentiment models that are applicable to any ad hoc topics. with a specifically designed hmm structure, the sentiment models and topic models estimated with tsm can be utilized to extract topic life cycles and sentiment dynamics. empirical experiments on different weblog datasets show that this approach is effective for modeling the topic facets and sentiments and extracting their dynamics from weblog collections. the tsm model is quite general; it can be applied to any text collections with a mixture of topics and sentiments, thus has many potential applications, such as search result summarization, opinion tracking, and user behavior prediction citee id:847 citee title:sentiment analyzer: extracting sentiments about a given topic using natural language processing techniques citee abstract:we present sentiment analyzer (sa) that extracts sentiment (or opinion) about a subject from online text documents. instead of classifying the sentiment of an entire document about a subject, sa detects all references to the given subject, and determines sentiment in each of the references using natural language processing (nlp) techniques. our sentiment analysis consists of 1) a topic specific feature term extraction, 2) sentiment extraction, and 3) (subject, sentiment) association by relationship analysis. sa utilizes two linguistic resources for the analysis: the sentiment lexicon and the sentiment pattern database. the performance of the algorithms was verified on online product review articles ("digital camera" and "music" reviews), and more general documents including general webpages and news articles. surrounding text:in a very recent work [4]<2>, the author proposed a topic dependent method for sentiment retrieval, which assumed that a sentence was generated from a probabilistic model consisting of both a topic language model and a sentiment language model. a similar approach could be found in [***]<2>. their vision of topic-sentiment dependency is similar to ours. however, those product opinion features are highly dependent on the training data sets, thus are not flexible to deal with ad hoc queries and topics. the same problem is shared with [***]<2>. they also did not provide a way to model sentiment dynamics influence:1 type:2 pair index:1058 citer id:863 citer title:topic sentiment mixture citer abstract:in this paper, we define the problem of topic-sentiment analysis on weblogs and propose a novel probabilistic model to capture the mixture of topics and sentiments simultaneously. the proposed topic-sentiment mixture (tsm) model can reveal the latent topical facets in a weblog collection, the subtopics in the results of an ad hoc query, and their associated sentiments. it could also provide general sentiment models that are applicable to any ad hoc topics. with a specifically designed hmm structure, the sentiment models and topic models estimated with tsm can be utilized to extract topic life cycles and sentiment dynamics. empirical experiments on different weblog datasets show that this approach is effective for modeling the topic facets and sentiments and extracting their dynamics from weblog collections. the tsm model is quite general; it can be applied to any text collections with a mixture of topics and sentiments, thus has many potential applications, such as search result summarization, opinion tracking, and user behavior prediction citee id:29 citee title:a cross-collection mixture model for comparative text mining citee abstract:problem, which we refer to as comparative text mining (ctm). given a set of comparable text collections, the task of comparative text mining is to discover any latent common themes across all collections as well as summarize the similarity and di#erences of these collections along each common theme. this general problem subsumes many interesting applications, including business intelligence and opinion summarization. we propose a generative probabilistic mixture model for comparative text surrounding text:for example, a user may like the price and fuel efficiency of a new toyota camry, but dislike its power and safety aspects. indeed, people tend to have different opinions about different features of a product [***, 13]<2>. as another example, a voter may agree with some points made by a presidential candidate, but disagree with some others. they also requires sentiment training data for every topic, or manually input sentiment keywords, while we can learn general sentiment models applicable to ad hoc topics. most opinion extraction work tries to find general opinions on a given topic but did not distinguish sentiments [***, 15]<2>. liu and others extracted product features and opinion features for a product, thus were able to provide sentiments for different features of a product influence:2 type:2 pair index:1059 citer id:866 citer title:using polarity scores of words for sentence-level opinion extraction citer abstract:the opinion analysis task is a pilot study task in ntcir-6. it contains the challenges of opinion sentence extraction, opinion polarity judgment, opinion holder extraction and relevance sentence extraction. the three former are new tasks, and the latter is proven to be tough in trec. in this paper, we introduce our system for analyzing opinionated information. several formulae are proposed to decide the opinion polarities and strengths of words from composed characters and then further to process opinion sentences. the negation operators are also taken into consideration in opinion polarity judgment, and the opinion operators are used as clues to find the locations of opinion holders. the performance of the opinion extraction and polarity judgment achieves the f-measure 0.383 under the lenient metric and 0.180 under the strict metric, which is the second best of all participants citee id:333 citee title:combining low-level and summary representations of opinions for multi-perspective question answering citee abstract:while much recent progress has been made in research on fact-based question answering, our work aims to extend question-answering research in a different direction --- to handle multi-perspective question-answering tasks, i.e. question-answering tasks that require an ability to find and organize opinions in text. in particular, this paper proposes an approach to multi-perspective question answering that views the task as one of opinion-oriented information extraction. we first describe an surrounding text:machine learning approaches such as naive bayes, maximum entropy classification, and support vector machines have been investigated [6]<2>. both information retrieval [2]<2> and information extraction [***]<2> technologies have also been explored, however, various metrics and testing beds are employed, which leads to incomparable results. building common testing sets and evaluation influence:2 type:3 pair index:1060 citer id:866 citer title:using polarity scores of words for sentence-level opinion extraction citer abstract:the opinion analysis task is a pilot study task in ntcir-6. it contains the challenges of opinion sentence extraction, opinion polarity judgment, opinion holder extraction and relevance sentence extraction. the three former are new tasks, and the latter is proven to be tough in trec. in this paper, we introduce our system for analyzing opinionated information. several formulae are proposed to decide the opinion polarities and strengths of words from composed characters and then further to process opinion sentences. the negation operators are also taken into consideration in opinion polarity judgment, and the opinion operators are used as clues to find the locations of opinion holders. the performance of the opinion extraction and polarity judgment achieves the f-measure 0.383 under the lenient metric and 0.180 under the strict metric, which is the second best of all participants citee id:629 citee title:mining the peanut gallery: opinion extraction and semantic classification of product reviews citee abstract:the web contains a wealth of product reviews, but sifting through them is a daunting task. ideally, an opinion mining tool would process a set of search results for a given item, generating a list of product attributes (quality, features, etc.) and aggregating opinions about each of them (poor, mixed, good). we begin by identifying the unique properties of this problem and develop a method for automatically distinguishing between positive and negative reviews. our classifier draws on information retrieval techniques for feature extraction and scoring, and the results for various metrics and heuristics vary depending on the testing situation. the best methods work as well as or better than traditional machine learning. when operating on individual sentences collected from web searches, performance is limited due to noise and ambiguity. but in the context of a complete web-based tool and aided by a simple method for grouping sentences into attributes, the results are qualitatively quite useful. surrounding text:people are concerned about opinions, and this makes the techniques of opinion information processing practical$ generally speaking, opinions are divided into three categories: positive, neutral and negative. opinions of different polarities in documents are useful references or feedbacks for governments or companies helping them improve their services or products [***]<3>. opinions are usually about a theme, and are viewed after grouping by the target which opinions toward to, the opinion holders or the opinion polarities influence:2 type:3,2 pair index:1061 citer id:866 citer title:using polarity scores of words for sentence-level opinion extraction citer abstract:the opinion analysis task is a pilot study task in ntcir-6. it contains the challenges of opinion sentence extraction, opinion polarity judgment, opinion holder extraction and relevance sentence extraction. the three former are new tasks, and the latter is proven to be tough in trec. in this paper, we introduce our system for analyzing opinionated information. several formulae are proposed to decide the opinion polarities and strengths of words from composed characters and then further to process opinion sentences. the negation operators are also taken into consideration in opinion polarity judgment, and the opinion operators are used as clues to find the locations of opinion holders. the performance of the opinion extraction and polarity judgment achieves the f-measure 0.383 under the lenient metric and 0.180 under the strict metric, which is the second best of all participants citee id:403 citee title:mining and summarizing customer reviews citee abstract:merchants selling products on the web often ask their customers to review the products that they have purchased and the associated services. as e-commerce is becoming more and more popular, the number of customer reviews that a product receives grows rapidly. for a popular product, the number of reviews can be in hundreds or even thousands. this makes it difficult for a potential customer to read them to make an informed decision on whether to purchase the product. it also makes it difficult for the manufacturer of the product to keep track and to manage customer opinions. for the manufacturer, there are additional difficulties because many merchant sites may sell the same product and the manufacturer normally produces many kinds of products. in this research, we aim to mine and to summarize all the customer reviews of a product. this summarization task is different from traditional text summarization because we only mine the features of the product on which the customers have expressed their opinions and whether the opinions are positive or negative. we do not summarize the reviews by selecting a subset or rewrite some of the original sentences from the reviews to capture the main points as in the classic text summarization. our task is performed in three steps: (1) mining product features that have been commented on by customers; (2) identifying opinion sentences in each review and deciding whether each opinion sentence is positive or negative; (3) summarizing the results. this paper proposes several novel techniques to perform these tasks. our experimental results using reviews of a number of products sold online demonstrate the effectiveness of the techniques surrounding text:researches of extracting opinions in documents of a specific genre, reviews, also use one document as their judging unit. daves and hus researches both focused on extracting opinions of 3c product reviews [2][***]<2>, while bai, padman and airoldi [11]<2> use movie reviews as experimental materials. as for sentences, they are the basic unit for a person to express a complete idea influence:2 type:3,2 pair index:1062 citer id:866 citer title:using polarity scores of words for sentence-level opinion extraction citer abstract:the opinion analysis task is a pilot study task in ntcir-6. it contains the challenges of opinion sentence extraction, opinion polarity judgment, opinion holder extraction and relevance sentence extraction. the three former are new tasks, and the latter is proven to be tough in trec. in this paper, we introduce our system for analyzing opinionated information. several formulae are proposed to decide the opinion polarities and strengths of words from composed characters and then further to process opinion sentences. the negation operators are also taken into consideration in opinion polarity judgment, and the opinion operators are used as clues to find the locations of opinion holders. the performance of the opinion extraction and polarity judgment achieves the f-measure 0.383 under the lenient metric and 0.180 under the strict metric, which is the second best of all participants citee id:111 citee title:determining the sentiment of opinions citee abstract:identifying sentiments (the affective parts of opinions) is a challenging problem. we present a system that, given a topic, automatically finds the people who hold opinions about that topic and the sentiment of each opinion. the system contains a module for determining word sentiment and another for combining sentiments within a sentence. we experiment with various models of classifying and combining sentiment at word and sentence levels, with promising results. surrounding text:as for sentences, they are the basic unit for a person to express a complete idea. riloff and wiebe distinguished subjective sentences [7]<2>, while kim and hovy proposed a sentiment classifier for english words and sentences [***]<2>. of course, the composed opinion words must be recognized first to process opinion documents and sentences influence:1 type:2 pair index:1063 citer id:866 citer title:using polarity scores of words for sentence-level opinion extraction citer abstract:the opinion analysis task is a pilot study task in ntcir-6. it contains the challenges of opinion sentence extraction, opinion polarity judgment, opinion holder extraction and relevance sentence extraction. the three former are new tasks, and the latter is proven to be tough in trec. in this paper, we introduce our system for analyzing opinionated information. several formulae are proposed to decide the opinion polarities and strengths of words from composed characters and then further to process opinion sentences. the negation operators are also taken into consideration in opinion polarity judgment, and the opinion operators are used as clues to find the locations of opinion holders. the performance of the opinion extraction and polarity judgment achieves the f-measure 0.383 under the lenient metric and 0.180 under the strict metric, which is the second best of all participants citee id:703 citee title:major topic detection and its application to opinion summarization citee abstract:watching specific information sources and summarizing the newly discovered opinions is important for governments to improve their services and companies to improve their products . because no queries are posed beforehand, detecting opinions is similar to the task of topic detection on sentence level. besides telling which opinions are positive or negative, identifying which events correlated with such opinions are also important. this paper proposes a major topic detection mechanism to capture main concepts embedded implicitly in a relevant document set. opinion summarization further retrieves all the relevant sentences related to the major topic from the document set, determines the opinion polarity of each relevant sentence, and finally summarizes positive and negative sentences, respectively. surrounding text:since all documents are relevant to the selected opinion topics, sentences in these testing documents are all treated as relevant to achieve the baseline performance and we focus on the opinion related tasks this year. in the future, for selecting topical words and further retrieving relevant sentences, the existing algorithm, which works well on trec materials [***]<3>, can be applied to improve the performance. we have developed a chinese opinion extraction system influence:2 type:2 pair index:1064 citer id:866 citer title:using polarity scores of words for sentence-level opinion extraction citer abstract:the opinion analysis task is a pilot study task in ntcir-6. it contains the challenges of opinion sentence extraction, opinion polarity judgment, opinion holder extraction and relevance sentence extraction. the three former are new tasks, and the latter is proven to be tough in trec. in this paper, we introduce our system for analyzing opinionated information. several formulae are proposed to decide the opinion polarities and strengths of words from composed characters and then further to process opinion sentences. the negation operators are also taken into consideration in opinion polarity judgment, and the opinion operators are used as clues to find the locations of opinion holders. the performance of the opinion extraction and polarity judgment achieves the f-measure 0.383 under the lenient metric and 0.180 under the strict metric, which is the second best of all participants citee id:118 citee title:thumbs up? sentiment classification using machine learning techniques citee abstract:we consider the problem of classifying documents not by topic, but by overall sentiment, e.g., determining whether a review is positive or negative. using movie reviews as data, we find that standard machine learning techniques definitively outperform human-produced baselines. however, the three machine learning methods we employed (naive bayes, maximum entropy classification, and support vector machines) do not perform as well on sentiment classification as on traditional topic-based categorization. we conclude by examining factors that make the sentiment classification problem more challenging. surrounding text:generally speaking, the unit for opinion information can be one document, one sentence, or a single word. wiebe, wilson and bell [9]<2> and pang, lee, and vaithyanathan [***]<2> processed opinion documents and their sentiment or opinion polarities. researches of extracting opinions in documents of a specific genre, reviews, also use one document as their judging unit. many techniques of nlp were also studied for opinion information processing. machine learning approaches such as naive bayes, maximum entropy classification, and support vector machines have been investigated [***]<2>. both information retrieval [2]<2> and information extraction [1]<2> technologies have also been explored, however, various metrics and testing beds are employed, which leads to incomparable results influence:2 type:2 pair index:1065 citer id:866 citer title:using polarity scores of words for sentence-level opinion extraction citer abstract:the opinion analysis task is a pilot study task in ntcir-6. it contains the challenges of opinion sentence extraction, opinion polarity judgment, opinion holder extraction and relevance sentence extraction. the three former are new tasks, and the latter is proven to be tough in trec. in this paper, we introduce our system for analyzing opinionated information. several formulae are proposed to decide the opinion polarities and strengths of words from composed characters and then further to process opinion sentences. the negation operators are also taken into consideration in opinion polarity judgment, and the opinion operators are used as clues to find the locations of opinion holders. the performance of the opinion extraction and polarity judgment achieves the f-measure 0.383 under the lenient metric and 0.180 under the strict metric, which is the second best of all participants citee id:120 citee title:learning extraction patterns for subjective expressions citee abstract:this paper presents a bootstrapping process that learns linguistically rich extraction patterns for subjective (opinionated) expressions. high-precision classifiers label unannotated data to automatically create a large training set, which is then given to an extraction pattern learning algorithm. the learned patterns are then used to identify more subjective sentences. the bootstrapping process learns many subjective patterns and increases recall while maintaining high precision. surrounding text:as for sentences, they are the basic unit for a person to express a complete idea. riloff and wiebe distinguished subjective sentences [***]<2>, while kim and hovy proposed a sentiment classifier for english words and sentences [4]<2>. of course, the composed opinion words must be recognized first to process opinion documents and sentences influence:2 type:2 pair index:1066 citer id:866 citer title:using polarity scores of words for sentence-level opinion extraction citer abstract:the opinion analysis task is a pilot study task in ntcir-6. it contains the challenges of opinion sentence extraction, opinion polarity judgment, opinion holder extraction and relevance sentence extraction. the three former are new tasks, and the latter is proven to be tough in trec. in this paper, we introduce our system for analyzing opinionated information. several formulae are proposed to decide the opinion polarities and strengths of words from composed characters and then further to process opinion sentences. the negation operators are also taken into consideration in opinion polarity judgment, and the opinion operators are used as clues to find the locations of opinion holders. the performance of the opinion extraction and polarity judgment achieves the f-measure 0.383 under the lenient metric and 0.180 under the strict metric, which is the second best of all participants citee id:547 citee title:extracting semantic orientations of words using spin model citee abstract:we propose a method for extracting semantic orientations of words: desirable or undesirable. regarding semantic orientations as spins of electrons, we use the mean field approximation to compute the approximate probability function of the system instead of the intractable actual probability function. we also propose a criterion for parameter selection on the basis of magnetization. given only a small number of seed words, the proposed method extracts semantic orientations with high accuracy in the experiments on english lexicon. the result is comparable to the best value ever reported surrounding text:of course, the composed opinion words must be recognized first to process opinion documents and sentences. riloff, wiebe and wilson [12]<2> learned opinion nouns from patterns, and takamura, inui and okumura [***]<2> adopt a physical model to decide opinion polarities of words. many techniques of nlp were also studied for opinion information processing influence:2 type:2 pair index:1067 citer id:866 citer title:using polarity scores of words for sentence-level opinion extraction citer abstract:the opinion analysis task is a pilot study task in ntcir-6. it contains the challenges of opinion sentence extraction, opinion polarity judgment, opinion holder extraction and relevance sentence extraction. the three former are new tasks, and the latter is proven to be tough in trec. in this paper, we introduce our system for analyzing opinionated information. several formulae are proposed to decide the opinion polarities and strengths of words from composed characters and then further to process opinion sentences. the negation operators are also taken into consideration in opinion polarity judgment, and the opinion operators are used as clues to find the locations of opinion holders. the performance of the opinion extraction and polarity judgment achieves the f-measure 0.383 under the lenient metric and 0.180 under the strict metric, which is the second best of all participants citee id:624 citee title:identify collocations for recognizing opinions citee abstract:subjectivity in natural language refers to aspects of language used to express opinions and evaluations surrounding text:generally speaking, the unit for opinion information can be one document, one sentence, or a single word. wiebe, wilson and bell [***]<2> and pang, lee, and vaithyanathan [6]<2> processed opinion documents and their sentiment or opinion polarities. researches of extracting opinions in documents of a specific genre, reviews, also use one document as their judging unit influence:2 type:2 pair index:1068 citer id:866 citer title:using polarity scores of words for sentence-level opinion extraction citer abstract:the opinion analysis task is a pilot study task in ntcir-6. it contains the challenges of opinion sentence extraction, opinion polarity judgment, opinion holder extraction and relevance sentence extraction. the three former are new tasks, and the latter is proven to be tough in trec. in this paper, we introduce our system for analyzing opinionated information. several formulae are proposed to decide the opinion polarities and strengths of words from composed characters and then further to process opinion sentences. the negation operators are also taken into consideration in opinion polarity judgment, and the opinion operators are used as clues to find the locations of opinion holders. the performance of the opinion extraction and polarity judgment achieves the f-measure 0.383 under the lenient metric and 0.180 under the strict metric, which is the second best of all participants citee id:814 citee title:overview of the trec 2003 novelty track citee abstract:the novelty track was first introduced in trec 2002. given a trec topic and an ordered list of documents, systems must find the relevant and novel sentences that should be returned to the user from this set. this task integrates aspects of passage retrieval and information filtering. this year, rather than using old trec topics and documents, we developed fifty new topics specifically for the novelty track. these topics were of two classes: "events" and "opinions". additionally, the documents surrounding text:one of the three major conferences, trec, tried to survey these techniques by having the novelty track. [***]<3> however, this task is proven to be tough because of the lack of information in only one sentence. moreover, extracting opinion holders is beyond extracting named entities influence:2 type:3 pair index:1069 citer id:866 citer title:using polarity scores of words for sentence-level opinion extraction citer abstract:the opinion analysis task is a pilot study task in ntcir-6. it contains the challenges of opinion sentence extraction, opinion polarity judgment, opinion holder extraction and relevance sentence extraction. the three former are new tasks, and the latter is proven to be tough in trec. in this paper, we introduce our system for analyzing opinionated information. several formulae are proposed to decide the opinion polarities and strengths of words from composed characters and then further to process opinion sentences. the negation operators are also taken into consideration in opinion polarity judgment, and the opinion operators are used as clues to find the locations of opinion holders. the performance of the opinion extraction and polarity judgment achieves the f-measure 0.383 under the lenient metric and 0.180 under the strict metric, which is the second best of all participants citee id:774 citee title:on learning parsimonious models for extracting consumer opinions citee abstract:extracting sentiments from unstructured text has emerged as an important problem in many disciplines. an accurate method would enable us, for example, to mine online opinions from the internet and learn customers preferences for economic or marketing research, or for leveraging a strategic advantage. in this paper, we propose a two-stage bayesian algorithm that is able to capture the dependencies among words, and, at the same time, finds a vocabulary that is efficient for the purpose of extracting sentiments. experimental results on the movie reviews data set show that our algorithm is able to select a parsimonious feature set with substantially fewer predictor variables than in the full data set and leads to better predictions about sentiment orientations than several state-of-the-art machine learning methods. our findings suggest that sentiments are captured by conditional dependence relations among words, rather than by keywords or high-frequency words. surrounding text:researches of extracting opinions in documents of a specific genre, reviews, also use one document as their judging unit. daves and hus researches both focused on extracting opinions of 3c product reviews [2][3]<2>, while bai, padman and airoldi [***]<2> use movie reviews as experimental materials. as for sentences, they are the basic unit for a person to express a complete idea influence:2 type:2 pair index:1070 citer id:866 citer title:using polarity scores of words for sentence-level opinion extraction citer abstract:the opinion analysis task is a pilot study task in ntcir-6. it contains the challenges of opinion sentence extraction, opinion polarity judgment, opinion holder extraction and relevance sentence extraction. the three former are new tasks, and the latter is proven to be tough in trec. in this paper, we introduce our system for analyzing opinionated information. several formulae are proposed to decide the opinion polarities and strengths of words from composed characters and then further to process opinion sentences. the negation operators are also taken into consideration in opinion polarity judgment, and the opinion operators are used as clues to find the locations of opinion holders. the performance of the opinion extraction and polarity judgment achieves the f-measure 0.383 under the lenient metric and 0.180 under the strict metric, which is the second best of all participants citee id:696 citee title:learning subjective nouns using extraction pattern bootstrapping citee abstract:we explore the idea of creating a subjectivity classifier that uses lists of subjective nouns learned by bootstrapping algorithms. the goal of our research is to develop a system that can distinguish subjective sentences from objective sentences. first, we use two bootstrapping algorithms that exploit extraction patterns to learn sets of subjective nouns. then we train a naive bayes classifier using the subjective nouns, discourse features, and subjectivity clues identified in prior surrounding text:of course, the composed opinion words must be recognized first to process opinion documents and sentences. riloff, wiebe and wilson [***]<2> learned opinion nouns from patterns, and takamura, inui and okumura [8]<2> adopt a physical model to decide opinion polarities of words. many techniques of nlp were also studied for opinion information processing influence:2 type:2 pair index:1071 citer id:866 citer title:using polarity scores of words for sentence-level opinion extraction citer abstract:the opinion analysis task is a pilot study task in ntcir-6. it contains the challenges of opinion sentence extraction, opinion polarity judgment, opinion holder extraction and relevance sentence extraction. the three former are new tasks, and the latter is proven to be tough in trec. in this paper, we introduce our system for analyzing opinionated information. several formulae are proposed to decide the opinion polarities and strengths of words from composed characters and then further to process opinion sentences. the negation operators are also taken into consideration in opinion polarity judgment, and the opinion operators are used as clues to find the locations of opinion holders. the performance of the opinion extraction and polarity judgment achieves the f-measure 0.383 under the lenient metric and 0.180 under the strict metric, which is the second best of all participants citee id:113 citee title:opinion extraction, summarization and tracking in news and blog corpora citee abstract:humans like to express their opinions and are eager to know others opinions. automatically mining and organizing opinions from heterogeneous information sources are very useful for individuals, organizations and even governments. opinion extraction, opinion summarization and opinion tracking are three important techniques for understanding opinions. opinion extraction mines opinions at word, sentence and document levels from articles. opinion summarization summarizes opinions of articles by telling sentiment polarities, degree and the correlated events. in this paper, both news and web blog articles are investigated. trec, ntcir and articles collected from web blogs serve as the information sources for opinion extraction. documents related to the issue of animal cloning are selected as the experimental materials. algorithms for opinion extraction at word, sentence and document level are proposed. the issue of relevant sentence selection is discussed, and then topical and opinionated information are summarized. opinion summarizations are visualized by representative sentences. text-based summaries in different languages, and from different sources, are compared. finally, an opinionated curve showing supportive and nonsupportive degree along the timeline is illustrated by an opinion tracking system. surrounding text:intuitively, a chinese sentiment dictionary is indispensable. we adopt a chinese opinion dictionary ntusd [***]<1>. ntusd consists of 2,812 positive and 8,276 negative opinion words influence:2 type:3,2 pair index:0 citer id:26 citer title:clustering methods for collaborative filtering citer abstract:grouping people into clusters based on the items they have purchased allows accurate recommendations of new items for purchase: if you and i have liked many of the same movies, then i will probably enjoy other movies that you like. recommending items based on similarity of interest (a.k.a. collaborative filtering) is attractive for many domains: books, cds, movies, etc., but does not always work well. because data are always sparse { any given person has seen only a small fraction of all movies { much more accurate predictions can be made by grouping people into clusters with similar movies and grouping movies into clusters which tend to be liked by the same people. finding optimal clusters is tricky because the movie groups should be used to help determine the people groups and visa versa. we present a formal statistical model of collaborative filtering, and compare di erent algorithms for estimating the model parameters including variations of k-means clustering and gibbs sampling. this formal model is easily extended to handle clustering of objects with multiple attributes citee id:27 citee title:using collaborative filtering to weave an information tapestry citee abstract:the tapestry experimental mail system developed at the xerox palo alto research center is predicated on the belief that information filtering can be more effective when humans are involved in the filtering process. tapestry was designed to support both content-based filtering and collaborative filtering, which entails people collaborating to help each other perform filtering by recording their reactions to documents they read. the reactions are called annotations; they can be accessed by other peoples filters. tapestry is intended to handle any incoming stream of electronic documents and serves both as a mail filter and repository; its components are the indexer, document store, annotation store, filterer, little box, remailer, appraiser and reader/browser. tapestrys client/server architecture, its various components, and the tapestry query language are described. surrounding text:collaborative filtering methods have been applied to many applications both in research (goldberg et al$, 1992, sheth and maes, 1993; maes and shardanand, 1995; konstan et al. , 1997)[***,6,7,11] and in industry (see http://www. sims influence:1 type:2 pair index:1 citer id:26 citer title:clustering methods for collaborative filtering citer abstract:grouping people into clusters based on the items they have purchased allows accurate recommendations of new items for purchase: if you and i have liked many of the same movies, then i will probably enjoy other movies that you like. recommending items based on similarity of interest (a.k.a. collaborative filtering) is attractive for many domains: books, cds, movies, etc., but does not always work well. because data are always sparse { any given person has seen only a small fraction of all movies { much more accurate predictions can be made by grouping people into clusters with similar movies and grouping movies into clusters which tend to be liked by the same people. finding optimal clusters is tricky because the movie groups should be used to help determine the people groups and visa versa. we present a formal statistical model of collaborative filtering, and compare di erent algorithms for estimating the model parameters including variations of k-means clustering and gibbs sampling. this formal model is easily extended to handle clustering of objects with multiple attributes citee id:28 citee title:grouplens: applying collaborative filtering to usenet news citee abstract:the grouplens project designed, implemented, and evaluated a collaborative filtering system for usenet newsa high-volume, high-turnover discussion list service on the internet. usenet newsgroupsthe individual discussion listsmay carry hundreds of messages each day. while in theory the newsgroup organization allows readers to select the content that most interests them, in practice surrounding text:collaborative filtering methods have been applied to many applications both in research (goldberg et al$, 1992, sheth and maes, 1993; maes and shardanand, 1995; konstan et al. , 1997)[4,***,7,11] and in industry (see http://www. sims influence:1 type:2 pair index:2 citer id:32 citer title:collaborative filtering via gaussian probabilistic latent semantic analysis citer abstract:collaborative filtering aims at learning predictive models of user preferences, interests or behavior from community data, i.e. a database of available user preferences. in this paper, we describe a new model-based algorithm designed for this task, which is based on a generalization of probabilistic latent semantic analysis to continuous-valued response variables. more specifically, we assume that the observed user ratings can be modeled as a mixture of user communities or interest groups, where users may participate probabilistically in one or more groups. each community is characterized by a gaussian distribution on the normalized ratings for each item. the normalization of ratings is performed in a user-specific manner to account for variations in absolute shift and variance of ratings. experiments on the eachmovie data set show that the proposed approach compares favorably with other collaborative filtering techniques citee id:27 citee title:using collaborative filtering to weave an information tapestry citee abstract:the tapestry experimental mail system developed at the xerox palo alto research center is predicated on the belief that information filtering can be more effective when humans are involved in the filtering process. tapestry was designed to support both content-based filtering and collaborative filtering, which entails people collaborating to help each other perform filtering by recording their reactions to documents they read. the reactions are called annotations; they can be accessed by other peoples filters. tapestry is intended to handle any incoming stream of electronic documents and serves both as a mail filter and repository; its components are the indexer, document store, annotation store, filterer, little box, remailer, appraiser and reader/browser. tapestrys client/server architecture, its various components, and the tapestry query language are described. surrounding text:then predictions and recommendations are computed based on the preferences and judgments of these similar or likeminded users. recommender systems using memory-based technology include [***], the grouplens (and movielens) project [2, 3], ringo [4] as well as a number of commercial systems, most notably the systems deployed at amazon. com and cdnow - 7. references [***] d. goldberg, d influence:1 type:2 pair index:3 citer id:32 citer title:collaborative filtering via gaussian probabilistic latent semantic analysis citer abstract:collaborative filtering aims at learning predictive models of user preferences, interests or behavior from community data, i.e. a database of available user preferences. in this paper, we describe a new model-based algorithm designed for this task, which is based on a generalization of probabilistic latent semantic analysis to continuous-valued response variables. more specifically, we assume that the observed user ratings can be modeled as a mixture of user communities or interest groups, where users may participate probabilistically in one or more groups. each community is characterized by a gaussian distribution on the normalized ratings for each item. the normalization of ratings is performed in a user-specific manner to account for variations in absolute shift and variance of ratings. experiments on the eachmovie data set show that the proposed approach compares favorably with other collaborative filtering techniques citee id:33 citee title:grouplens: an open architecture for collaborative filtering of netnews citee abstract:collaborative filters help people make choices based on the opinions of other people. grouplens is a system for collaborative filtering of netnews, to help people find articles they will like in the huge stream of available articles. news reader clients display predicted scores and make it easy for users to rate articles after they read them. rating servers, called better bit bureaus, gather and disseminate the ratings. the rating servers predict scores based on the heuristic that people who surrounding text:then predictions and recommendations are computed based on the preferences and judgments of these similar or likeminded users. recommender systems using memory-based technology include [1], the grouplens (and movielens) project [***, 3], ringo [4] as well as a number of commercial systems, most notably the systems deployed at amazon. com and cdnow influence:1 type:2 pair index:4 citer id:32 citer title:collaborative filtering via gaussian probabilistic latent semantic analysis citer abstract:collaborative filtering aims at learning predictive models of user preferences, interests or behavior from community data, i.e. a database of available user preferences. in this paper, we describe a new model-based algorithm designed for this task, which is based on a generalization of probabilistic latent semantic analysis to continuous-valued response variables. more specifically, we assume that the observed user ratings can be modeled as a mixture of user communities or interest groups, where users may participate probabilistically in one or more groups. each community is characterized by a gaussian distribution on the normalized ratings for each item. the normalization of ratings is performed in a user-specific manner to account for variations in absolute shift and variance of ratings. experiments on the eachmovie data set show that the proposed approach compares favorably with other collaborative filtering techniques citee id:28 citee title:grouplens: applying collaborative filtering to usenet news citee abstract:the grouplens project designed, implemented, and evaluated a collaborative filtering system for usenet newsa high-volume, high-turnover discussion list service on the internet. usenet newsgroupsthe individual discussion listsmay carry hundreds of messages each day. while in theory the newsgroup organization allows readers to select the content that most interests them, in practice surrounding text:then predictions and recommendations are computed based on the preferences and judgments of these similar or likeminded users. recommender systems using memory-based technology include [1], the grouplens (and movielens) project [2, ***], ringo [4] as well as a number of commercial systems, most notably the systems deployed at amazon. com and cdnow influence:1 type:2 pair index:5 citer id:34 citer title:collaborative filtering with decoupled models for preferences and ratings citer abstract:in this paper, we describe a new model for collaborative filtering. the motivation of this work comes from the fact that two users with very similar preferences on items may have very different rating schemes. for example, one user may tend to assign a higher rating to all items than another user. unlike previous models of collaborative filtering, which determine the similarity between two users only based on their rating performance, our model treats the users preferences on items separately from the users rating scheme. more specifically, for each user, we build two separate models: a preference model capturing which items are favored by the user and a rating model capturing how the user would rate an item given the preference information. the similarity of two users is computed based on the underlying preference model, instead of the surface ratings. we compare the new model with several representative previous approaches on two data sets. experiment results show that the new model outperforms all the previous approaches that are tested consistently on both data sets citee id:35 citee title:empirical analysis of predictive algorithms for collaborative filtering citee abstract:collaborative filtering or recommender systems use a database about user preferences to predict additional topics or products a new user might like. in this paper we describe several algorithms designed for this task, including techniques based on correlation coefficients, vector-based similarity calculations, and statistical bayesian methods. we compare the predictive accuracy of the various methods in a set of representative problem domains. we use two basic classes of evaluation metrics. the first characterizes accuracy over a set of individual predictions in terms of average absolute deviation. the second estimates the utility of a ranked list of suggested items. this metric uses an estimate of the probability that a user will see a recommendation in an ordered list. experiments were run for datasets associated with 3 application areas, 4 experimental protocols, and the 2 evaluation metrics for the various algorithms. results indicate that for a wide range of conditions, bayesian networks with decision trees at each node and correlation methods outperform bayesian-clustering and vector-similarity methods. between correlation and bayesian networks, the preferred method depends on the nature of the dataset, nature of the application (ranked versus one-by-one presentation), and the availability of votes with which to make predictions. other considerations include the size of database, speed of predictions, and learning time. surrounding text:in many cases, collaborative filtering reflects a more realistic setup, particularly in the web environment, where descriptions of items and users are not available due to the privacy issue. many algorithms have been proposed to deal with the collaborative filtering problem [***, 2, 3, 4, 5, 8, 12]. according to [***], most collaborative filtering algorithms can be categorized into two classes: memory-based algorithms and model-based algorithms - many algorithms have been proposed to deal with the collaborative filtering problem [***, 2, 3, 4, 5, 8, 12]. according to [***], most collaborative filtering algorithms can be categorized into two classes: memory-based algorithms and model-based algorithms. the memory-based algorithms first find the users from the training database that are most similar to the current test user in terms of the rating pattern, and then combine the ratings given by those similar users to obtain the prediction for the test user - the memory-based algorithms first find the users from the training database that are most similar to the current test user in terms of the rating pattern, and then combine the ratings given by those similar users to obtain the prediction for the test user. the major approaches within this category are the pearsoncorrelation based approach [4], the vector similarity based approach [***], and the extended generalized vector space model [3]. model-based approaches group together different users in the training database into a small number of classes based on their rating patterns - in order to predict the rating of a test user on a particular item, we can simply categorize the test user into one of the predefined user classes and use the predicted class as the prediction for the test user. proposed algorithms within this category include a bayesian network approach [***], and the aspect model [5]. compared with the memory-based approaches, the model-based approaches only have to store the profiles of models and therefore are much more efficient than storing the whole user database - furthermore, model-based approaches tend to assume that a small number of user classes is sufficient for modeling the rating patterns of users, thus imply a loss of diversity among users, which may be very important in helping predict the ratings of a test user. indeed, according to the previous studies on the comparison of memory-based approaches and model-based approaches, memory-based approaches such as the correlation method performs very close to those complicated model-based methods such as a bayesian network approach and an aspect model [***]. because both memory-based and model-based models permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page - special treatments are needed to deal with the missing ratings. for example, in the correlation approach, default ratings are introduced for those unrated items [***]. for some probabilistic models, an extra rating category named norating is introduced in the case when ratings are missing [***] - for example, in the correlation approach, default ratings are introduced for those unrated items [***]. for some probabilistic models, an extra rating category named norating is introduced in the case when ratings are missing [***]. the second issue arose out of the observation that users with similar preferences on items may still rate items differently - the second issue arose out of the observation that users with similar preferences on items may still rate items differently. indeed, in the previous study, the correlation approach usually outperforms the vector similarity approach significantly [***]. one major difference between these two approaches is that, for the correlation approach, the similarity (i$e$ the pearson correlation coefficients) between two users are measured based on the relative ratings, namely the original ratings subtracted by the average rating of each user, while for vector similarity approach, the original ratings are used for measuring the user similarity - one major difference between these two approaches is that, for the correlation approach, the similarity (i$e$ the pearson correlation coefficients) between two users are measured based on the relative ratings, namely the original ratings subtracted by the average rating of each user, while for vector similarity approach, the original ratings are used for measuring the user similarity. in the comparison of these two approaches in [***], the correlation method outperforms the vector similarity method significantly, particularly in terms of the mse metric. this fact indicates that, due to the variance in the rating behavior of different users, the rating information may not be able to indicate the preference information directly - 3. 1 baseline methods following [***,2,12], we compare our method to several existing methods, including the pearson correlation coefficient method, the vector similarity method and the personality diagnosis method. these methods are described below - these methods are described below. pearson correlation coefficient (pcc) according to [***], pearson correlation coefficient method predicts the rating of a test user y0 on item x as:   . - 73 129. 62 to compare the methods in full spectrum, we vary both the number of users in the training database and the number of observed items that are rated by the test user, as what have been done in [***, 8, 12]. for the smaller database movierating, we allow the number of users in the training database to be 100 and 200, and the number of exposed items rated by the test users to be 5, 10 and 20 - there are some other measures like the receiver operating characteristic (roc) as a decision-surrport accuracy measure [7] and the normalized mae. but since mae is the most commonly used metric and has been reported in most previous research [***, 2, 7, 8, 12], we chose it as the evaluation measure in our experiments to make our results more comparable. 3 - 5. references [***] j. s influence:1 type:2 pair index:6 citer id:34 citer title:collaborative filtering with decoupled models for preferences and ratings citer abstract:in this paper, we describe a new model for collaborative filtering. the motivation of this work comes from the fact that two users with very similar preferences on items may have very different rating schemes. for example, one user may tend to assign a higher rating to all items than another user. unlike previous models of collaborative filtering, which determine the similarity between two users only based on their rating performance, our model treats the users preferences on items separately from the users rating scheme. more specifically, for each user, we build two separate models: a preference model capturing which items are favored by the user and a rating model capturing how the user would rate an item given the preference information. the similarity of two users is computed based on the underlying preference model, instead of the surface ratings. we compare the new model with several representative previous approaches on two data sets. experiment results show that the new model outperforms all the previous approaches that are tested consistently on both data sets citee id:29 citee title:collaborative filtering and the generalized vector space model citee abstract:collaborative filtering is a technique for recommending documents to users based on how similar their tastes are to other users. if two users tend to agree on what they like, the system will recommend the same documents to them. the generalized vector space model of information retrieval represents a document by a vector of its similarities to all other documents. the process of collaborative filtering is nearly identical to the process of retrieval using gvsm in a matrix of user ratings. using this observation, a model for filtering collaboratively using document content is possible. surrounding text:in many cases, collaborative filtering reflects a more realistic setup, particularly in the web environment, where descriptions of items and users are not available due to the privacy issue. many algorithms have been proposed to deal with the collaborative filtering problem [1, 2, ***, 4, 5, 8, 12]. according to [1], most collaborative filtering algorithms can be categorized into two classes: memory-based algorithms and model-based algorithms - the memory-based algorithms first find the users from the training database that are most similar to the current test user in terms of the rating pattern, and then combine the ratings given by those similar users to obtain the prediction for the test user. the major approaches within this category are the pearsoncorrelation based approach [4], the vector similarity based approach [1], and the extended generalized vector space model [***]. model-based approaches group together different users in the training database into a small number of classes based on their rating patterns influence:1 type:2 pair index:7 citer id:34 citer title:collaborative filtering with decoupled models for preferences and ratings citer abstract:in this paper, we describe a new model for collaborative filtering. the motivation of this work comes from the fact that two users with very similar preferences on items may have very different rating schemes. for example, one user may tend to assign a higher rating to all items than another user. unlike previous models of collaborative filtering, which determine the similarity between two users only based on their rating performance, our model treats the users preferences on items separately from the users rating scheme. more specifically, for each user, we build two separate models: a preference model capturing which items are favored by the user and a rating model capturing how the user would rate an item given the preference information. the similarity of two users is computed based on the underlying preference model, instead of the surface ratings. we compare the new model with several representative previous approaches on two data sets. experiment results show that the new model outperforms all the previous approaches that are tested consistently on both data sets citee id:33 citee title:grouplens: an open architecture for collaborative filtering of netnews citee abstract:collaborative filters help people make choices based on the opinions of other people. grouplens is a system for collaborative filtering of netnews, to help people find articles they will like in the huge stream of available articles. news reader clients display predicted scores and make it easy for users to rate articles after they read them. rating servers, called better bit bureaus, gather and disseminate the ratings. the rating servers predict scores based on the heuristic that people who surrounding text:in many cases, collaborative filtering reflects a more realistic setup, particularly in the web environment, where descriptions of items and users are not available due to the privacy issue. many algorithms have been proposed to deal with the collaborative filtering problem [1, 2, 3, ***, 5, 8, 12]. according to [1], most collaborative filtering algorithms can be categorized into two classes: memory-based algorithms and model-based algorithms - the memory-based algorithms first find the users from the training database that are most similar to the current test user in terms of the rating pattern, and then combine the ratings given by those similar users to obtain the prediction for the test user. the major approaches within this category are the pearsoncorrelation based approach [***], the vector similarity based approach [1], and the extended generalized vector space model [3]. model-based approaches group together different users in the training database into a small number of classes based on their rating patterns - = ( ) ~ 2 ( ) ~ 2 ( ) ~ ( )^ ~ , 0 0 0 0 ( ) ( ) ( ) ( ) x x y y x x y y x x y x y y y y y r x r x r x r x w o (15) personality diagnosis (pd) this is a method introduced by pennock et al. in [***]. it treats each user in the training pool as an individual model influence:1 type:2 pair index:8 citer id:34 citer title:collaborative filtering with decoupled models for preferences and ratings citer abstract:in this paper, we describe a new model for collaborative filtering. the motivation of this work comes from the fact that two users with very similar preferences on items may have very different rating schemes. for example, one user may tend to assign a higher rating to all items than another user. unlike previous models of collaborative filtering, which determine the similarity between two users only based on their rating performance, our model treats the users preferences on items separately from the users rating scheme. more specifically, for each user, we build two separate models: a preference model capturing which items are favored by the user and a rating model capturing how the user would rate an item given the preference information. the similarity of two users is computed based on the underlying preference model, instead of the surface ratings. we compare the new model with several representative previous approaches on two data sets. experiment results show that the new model outperforms all the previous approaches that are tested consistently on both data sets citee id:36 citee title:latent class models for collaborative filtering citee abstract:this paper presents a statistical approach to collaborative filtering and investigates the use of latent class models for predicting individual choices and preferences based on observed preference behavior. two models are discussed and compared: the aspect model, a probabilistic latent space model which models individual preferences as a convex combination of preference factors, and the two-sided clustering model, which simultaneously partitions persons and objects into clusters. we present em algorithms for different variants of the aspect model and derive an approximate em algorithm based on a variational principle for the two-sided clustering model. the benefits of the different models are experimentally investigated on a large movie data set. surrounding text:in many cases, collaborative filtering reflects a more realistic setup, particularly in the web environment, where descriptions of items and users are not available due to the privacy issue. many algorithms have been proposed to deal with the collaborative filtering problem [1, 2, 3, 4, ***, 8, 12]. according to [1], most collaborative filtering algorithms can be categorized into two classes: memory-based algorithms and model-based algorithms - in order to predict the rating of a test user on a particular item, we can simply categorize the test user into one of the predefined user classes and use the predicted class as the prediction for the test user. proposed algorithms within this category include a bayesian network approach [1], and the aspect model [***]. compared with the memory-based approaches, the model-based approaches only have to store the profiles of models and therefore are much more efficient than storing the whole user database - we exploit model smoothing to deal with the problem of sparse data and missing values. although some previous probabilistic models (e$g$, the two-side clustering approach [***]) have indirectly captured the difference between user ratings and user preferences, our approach is more direct cwe explicitly convert ratings into preferences. the core part of our algorithm is a probabilistic mechanism that is able to transform the rating information of items into the likelihood for items to be preferred by a user influence:1 type:2 pair index:9 citer id:34 citer title:collaborative filtering with decoupled models for preferences and ratings citer abstract:in this paper, we describe a new model for collaborative filtering. the motivation of this work comes from the fact that two users with very similar preferences on items may have very different rating schemes. for example, one user may tend to assign a higher rating to all items than another user. unlike previous models of collaborative filtering, which determine the similarity between two users only based on their rating performance, our model treats the users preferences on items separately from the users rating scheme. more specifically, for each user, we build two separate models: a preference model capturing which items are favored by the user and a rating model capturing how the user would rate an item given the preference information. the similarity of two users is computed based on the underlying preference model, instead of the surface ratings. we compare the new model with several representative previous approaches on two data sets. experiment results show that the new model outperforms all the previous approaches that are tested consistently on both data sets citee id:37 citee title:combining content and collaboration in text filtering citee abstract:we describe a technique for combining collabfi orative input and document content for text ltering this technique uses latent semanfi tic indexing to create a collaborative view of a collection of user proles the proles themfi selves are term vectors constructed from docfi uments deemed relevant to the users informafi tion need using standard text collections this approach performs quite favorably compared to other contentfibased approaches surrounding text:in this paper, we focus on the task of collaborative filtering, which is to predict the utility of items for a particular user based on the ratings of information items given by many other users. the fact that collaborative filtering does not rely on any content information about the items or descriptions of users, but only depends on the preference patterns of users makes it more general than other tasks such as ad hoc information retrieval and content-based filtering [***,7]. in many cases, collaborative filtering reflects a more realistic setup, particularly in the web environment, where descriptions of items and users are not available due to the privacy issue influence:1 type:2 pair index:10 citer id:34 citer title:collaborative filtering with decoupled models for preferences and ratings citer abstract:in this paper, we describe a new model for collaborative filtering. the motivation of this work comes from the fact that two users with very similar preferences on items may have very different rating schemes. for example, one user may tend to assign a higher rating to all items than another user. unlike previous models of collaborative filtering, which determine the similarity between two users only based on their rating performance, our model treats the users preferences on items separately from the users rating scheme. more specifically, for each user, we build two separate models: a preference model capturing which items are favored by the user and a rating model capturing how the user would rate an item given the preference information. the similarity of two users is computed based on the underlying preference model, instead of the surface ratings. we compare the new model with several representative previous approaches on two data sets. experiment results show that the new model outperforms all the previous approaches that are tested consistently on both data sets citee id:38 citee title:content-boosted collaborative filtering for improved recommendations citee abstract:most recommender systems use collaborative filtering or content-based methods to predict new items of interest for a user. while both methods have their own advantages, individually they fail to provide good recommendations in many situations. incorporating components from both methods, a hybrid recommender system can overcome these shortcomings. in this paper, we present an elegant and effective framework for combining content and collaboration. our approach uses a content-based predictor tc enhance existing user data, and then provides personalized suggestions through collaborative filtering. we present experimental results that show how this approach, content-boosted collaborative filtering, performs better than a pure content-based predictor, pure collaborative filter, and a naive hybrid approach surrounding text:in this paper, we focus on the task of collaborative filtering, which is to predict the utility of items for a particular user based on the ratings of information items given by many other users. the fact that collaborative filtering does not rely on any content information about the items or descriptions of users, but only depends on the preference patterns of users makes it more general than other tasks such as ad hoc information retrieval and content-based filtering [6,***]. in many cases, collaborative filtering reflects a more realistic setup, particularly in the web environment, where descriptions of items and users are not available due to the privacy issue - we refer to this measure as the mean absolute error (mae) in the rest of this paper. there are some other measures like the receiver operating characteristic (roc) as a decision-surrport accuracy measure [***] and the normalized mae. but since mae is the most commonly used metric and has been reported in most previous research [1, 2, ***, 8, 12], we chose it as the evaluation measure in our experiments to make our results more comparable - there are some other measures like the receiver operating characteristic (roc) as a decision-surrport accuracy measure [***] and the normalized mae. but since mae is the most commonly used metric and has been reported in most previous research [1, 2, ***, 8, 12], we chose it as the evaluation measure in our experiments to make our results more comparable. 3 influence:1 type:2 pair index:11 citer id:34 citer title:collaborative filtering with decoupled models for preferences and ratings citer abstract:in this paper, we describe a new model for collaborative filtering. the motivation of this work comes from the fact that two users with very similar preferences on items may have very different rating schemes. for example, one user may tend to assign a higher rating to all items than another user. unlike previous models of collaborative filtering, which determine the similarity between two users only based on their rating performance, our model treats the users preferences on items separately from the users rating scheme. more specifically, for each user, we build two separate models: a preference model capturing which items are favored by the user and a rating model capturing how the user would rate an item given the preference information. the similarity of two users is computed based on the underlying preference model, instead of the surface ratings. we compare the new model with several representative previous approaches on two data sets. experiment results show that the new model outperforms all the previous approaches that are tested consistently on both data sets citee id:39 citee title:swami: a framework for collaborative filtering algorithm development and evaluation citee abstract:we present a java-based framework, swami (shared wisdom through the amalgamation of many interpretations) for building and studying collaborative filtering systems. swami consists of three components: a prediction engine, an evaluation system, and a visualization component. the prediction engine provides a common interface for implementing different prediction algorithms. the evaluation system provides a standardized testing methodology and metrics for analyzing the accuracy and run-time performance of prediction algorithms. the visualization component suggests how graphical representations can inform the development and analysis of prediction algorithms. we demonstrate swami on the eachmovie data set by comparing three prediction algorithms: a traditional pearson correlation-based method, support vector machines, and a new accurate and scalable correlation-based method based on clustering techniques. surrounding text:in many cases, collaborative filtering reflects a more realistic setup, particularly in the web environment, where descriptions of items and users are not available due to the privacy issue. many algorithms have been proposed to deal with the collaborative filtering problem [1, 2, 3, 4, 5, ***, 12]. according to [1], most collaborative filtering algorithms can be categorized into two classes: memory-based algorithms and model-based algorithms - 73 129. 62 to compare the methods in full spectrum, we vary both the number of users in the training database and the number of observed items that are rated by the test user, as what have been done in [1, ***, 12]. for the smaller database movierating, we allow the number of users in the training database to be 100 and 200, and the number of exposed items rated by the test users to be 5, 10 and 20 - there are some other measures like the receiver operating characteristic (roc) as a decision-surrport accuracy measure [7] and the normalized mae. but since mae is the most commonly used metric and has been reported in most previous research [1, 2, 7, ***, 12], we chose it as the evaluation measure in our experiments to make our results more comparable. 3 influence:1 type:2 pair index:12 citer id:34 citer title:collaborative filtering with decoupled models for preferences and ratings citer abstract:in this paper, we describe a new model for collaborative filtering. the motivation of this work comes from the fact that two users with very similar preferences on items may have very different rating schemes. for example, one user may tend to assign a higher rating to all items than another user. unlike previous models of collaborative filtering, which determine the similarity between two users only based on their rating performance, our model treats the users preferences on items separately from the users rating scheme. more specifically, for each user, we build two separate models: a preference model capturing which items are favored by the user and a rating model capturing how the user would rate an item given the preference information. the similarity of two users is computed based on the underlying preference model, instead of the surface ratings. we compare the new model with several representative previous approaches on two data sets. experiment results show that the new model outperforms all the previous approaches that are tested consistently on both data sets citee id:40 citee title:maximum likelihood from incomplete data via the em algorithm citee abstract:a broadly applicable algorithm for computing maximum likelihood estimates from incomplete data is presented at various levels of generality. theory showing the monotone behaviour of the likelihood and convergence of the algorithm is derived. many examples are sketched, including missing value situations, applications to grouped, censored or truncated data, finite mixture models, variance component estimation, hyperparameter estimation, iteratively reweighted least squares and factor analysis. surrounding text:compared with the memory-based approaches, the model-based approaches only have to store the profiles of models and therefore are much more efficient than storing the whole user database. on the other hand, the offline computation of memorybased approaches can be much cheaper than the model-based approaches, which often require the use of the expectationmaximization (em) [***] algorithm, variation method and other sophisticated methods to combat the complexity of computation. furthermore, model-based approaches tend to assume that a small number of user classes is sufficient for modeling the rating patterns of users, thus imply a loss of diversity among users, which may be very important in helping predict the ratings of a test user influence:3 type:3,1 pair index:13 citer id:34 citer title:collaborative filtering with decoupled models for preferences and ratings citer abstract:in this paper, we describe a new model for collaborative filtering. the motivation of this work comes from the fact that two users with very similar preferences on items may have very different rating schemes. for example, one user may tend to assign a higher rating to all items than another user. unlike previous models of collaborative filtering, which determine the similarity between two users only based on their rating performance, our model treats the users preferences on items separately from the users rating scheme. more specifically, for each user, we build two separate models: a preference model capturing which items are favored by the user and a rating model capturing how the user would rate an item given the preference information. the similarity of two users is computed based on the underlying preference model, instead of the surface ratings. we compare the new model with several representative previous approaches on two data sets. experiment results show that the new model outperforms all the previous approaches that are tested consistently on both data sets citee id:31 citee title:collaborative filtering by personality diagnosis: a hybrid memory- and model-based approach citee abstract:the growth of internet commerce has stimulated the use of collaborative filtering (cf) algorithms as recommender systems. such systems leverage knowledge about the known preferences of multiple users to recommend items of interest to other users. cf methods have been harnessed to make recommendations about such items as web pages, movies, books, and toys. researchers have proposed and evaluated many approaches for generating recommendations. we describe and evaluate a new method called personality diagnosis (pd). given a users preferences for some items, we compute the probability that he or she is of the same personality type as other users, and, in turn, the probability that he or she will like new items. pd retains some of the advantages of traditional similarity-weighting techniques in that all data is brought to bear on each prediction and new data can be added easily and incrementally. additionally, pd has a meaningful probabilistic interpretation, which may be leveraged to justify, explain, and augment results. we report empirical results on the eachmovie database of movie ratings, and on user profile data collected from the citeseer digital library of computer science research papers. the probabilistic framework naturally supports a variety of descriptive measurementsin particular, we consider the applicability of a value of information (voi) computation. surrounding text:in many cases, collaborative filtering reflects a more realistic setup, particularly in the web environment, where descriptions of items and users are not available due to the privacy issue. many algorithms have been proposed to deal with the collaborative filtering problem [1, 2, 3, 4, 5, 8, ***]. according to [1], most collaborative filtering algorithms can be categorized into two classes: memory-based algorithms and model-based algorithms - have their own advantages and disadvantages, there are hybrid approaches that try to unify these two types of approaches into a single model. the approach personal diagnosis [***] belongs to this category, which appears to outperform previous model-based and memory-based approaches. there are two major issues involved in the collaborative filtering task - 2) second, the preference similarities between the test user and the training users are computed using the estimated preference information. as in the work by pennock et al$ [***], the similarity between two users is computed based on the likelihood of mistaking one user as the other. 3) third, the preference patterns on the un-rated objects for the test user are computed as the combination of preference patterns of training users weighted by their preference similarities - 3. 1 baseline methods following [1,2,***], we compare our method to several existing methods, including the pearson correlation coefficient method, the vector similarity method and the personality diagnosis method. these methods are described below - 73 129. 62 to compare the methods in full spectrum, we vary both the number of users in the training database and the number of observed items that are rated by the test user, as what have been done in [1, 8, ***]. for the smaller database movierating, we allow the number of users in the training database to be 100 and 200, and the number of exposed items rated by the test users to be 5, 10 and 20 - there are some other measures like the receiver operating characteristic (roc) as a decision-surrport accuracy measure [7] and the normalized mae. but since mae is the most commonly used metric and has been reported in most previous research [1, 2, 7, 8, ***], we chose it as the evaluation measure in our experiments to make our results more comparable. 3 influence:1 type:2 pair index:14 citer id:53 citer title:effective missing data prediction for collaborative filtering citer abstract:memory-based collaborative filtering algorithms have been widely adopted in many popular recommender systems, although these approaches all suffer from data sparsity and poor prediction quality problems. usually, the user-item matrix is quite sparse, which directly leads to inaccurate recommendations. this paper focuses the memory-based collaborative filtering problems on two crucial factors: (1) similarity computation between users or items and (2) missing data prediction algorithms. first, we use the enhanced pearson correlation coefficient (pcc) algorithm by adding one parameter which overcomes the potential decrease of accuracy when computing the similarity of users or items. second, we propose an effective missing data prediction algorithm, in which information of both users and items is taken into account. in this algorithm, we set the similarity threshold for users and items respectively, and the prediction algorithm will determine whether predicting the missing data or not. we also address how to predict the missing data by employing a combination of user and item information. finally, empirical studies on dataset movielens have shown that our newly proposed method outperforms other stateofthe-art collaborative filtering algorithms and it is more robust against data sparsity citee id:54 citee title:unifying collaborative and content-based filtering citee abstract:collaborative and content-based filtering are two paradigms that have been applied in the context of recommender systems and user preference prediction. this paper proposes a novel, unified approach that systematically integrates all available training information such as past user-item ratings as well as attributes of items or users to learn a prediction function. the key ingredient of our method is the design of a suitable kernel or similarity function between user-item pairs that allows simultaneous generalization across the user and item dimensions. we propose an on-line algorithm (jrank) that generalizes perceptron learning. experimental results on the eachmovie data set show significant improvements over standard approaches. surrounding text:3 other related approaches in order to take the advantages of memory-based and model-based approaches, hybrid collaborative filtering methods have been studied recently [14, 22]. [***, 4] unified collaborative filtering and content-based filtering, which achieved significant improvements over the standard approaches. at the same time, in order to solve the data sparsity problem, researchers proposed dimensionality reduction approaches in [15] - ra,i is the rate user a gave item i, and ra represents the average rate of user a. from this definition, user similarity sim(a, u) is ranging from [0, ***], and a larger value means users a and u are more similar. item-based methods such as [5, 17] are similar to userbased approaches, and the difference is that item-based methods employ the similarity between the items instead of users - ru,i is the rate user u gave item i, and ri represents the average rate of item i. like user similarity, item similarity sim(i, j) is also ranging from [0, ***]. 3 - we use the following equation to solve this problem: sim(a, u) = min(ia iu, ) e sim(a, u), (4) where ia iu is the number of items which user a and user u rated in common. this change bounds the similarity sim(a, u) to the interval [0, ***]. then the similarity between items could be defined as: sim(i, j) = min(ui uj , ) e sim(i, j), (5) where ui uj is the number of users who rated both item i and item j - ) ~ (i + fi iks(i) sim(ik, i) e (ru,ik . ik) fi iks(i) sim(ik, i) ), (8) where is the parameter in the range of [0, ***]. the use of parameter allows us to determine how the prediction relies on user-based prediction and item-based prediction - 8. references [***] j. basilico and t influence:1 type:2 pair index:15 citer id:53 citer title:effective missing data prediction for collaborative filtering citer abstract:memory-based collaborative filtering algorithms have been widely adopted in many popular recommender systems, although these approaches all suffer from data sparsity and poor prediction quality problems. usually, the user-item matrix is quite sparse, which directly leads to inaccurate recommendations. this paper focuses the memory-based collaborative filtering problems on two crucial factors: (1) similarity computation between users or items and (2) missing data prediction algorithms. first, we use the enhanced pearson correlation coefficient (pcc) algorithm by adding one parameter which overcomes the potential decrease of accuracy when computing the similarity of users or items. second, we propose an effective missing data prediction algorithm, in which information of both users and items is taken into account. in this algorithm, we set the similarity threshold for users and items respectively, and the prediction algorithm will determine whether predicting the missing data or not. we also address how to predict the missing data by employing a combination of user and item information. finally, empirical studies on dataset movielens have shown that our newly proposed method outperforms other stateofthe-art collaborative filtering algorithms and it is more robust against data sparsity citee id:35 citee title:empirical analysis of predictive algorithms for collaborative filtering citee abstract:collaborative filtering or recommender systems use a database about user preferences to predict additional topics or products a new user might like. in this paper we describe several algorithms designed for this task, including techniques based on correlation coefficients, vector-based similarity calculations, and statistical bayesian methods. we compare the predictive accuracy of the various methods in a set of representative problem domains. we use two basic classes of evaluation metrics. the first characterizes accuracy over a set of individual predictions in terms of average absolute deviation. the second estimates the utility of a ranked list of suggested items. this metric uses an estimate of the probability that a user will see a recommendation in an ordered list. experiments were run for datasets associated with 3 application areas, 4 experimental protocols, and the 2 evaluation metrics for the various algorithms. results indicate that for a wide range of conditions, bayesian networks with decision trees at each node and correlation methods outperform bayesian-clustering and vector-similarity methods. between correlation and bayesian networks, the preferred method depends on the nature of the dataset, nature of the application (ranked versus one-by-one presentation), and the availability of votes with which to make predictions. other considerations include the size of database, speed of predictions, and learning time. surrounding text:the research of collaborative filtering started from memory-based approaches which utilize the entire user-item database to generate a prediction based on user or item similarity. two types of memory-based methods have been studied: user-based [***, 7, 10, 22] and item-based [5, 12, 17]. user-based methods first look for some similar users who have similar rating styles with the active user and then employ the ratings from those similar users to predict the ratings for the active user - whether in user-based approaches or in item-based approaches, the computation of similarity between users or items is a very critical step. notable similarity computation algorithms include pearson correlation coefficient (pcc) [16] and vector space similarity (vss) algorithm [***]. although memory-based approaches have been widely used in recommendation systems [12, 16], the problem of inaccurate recommendation results still exists in both user-based and item-based approaches - 1 memory-based approaches the memory-based approaches are the most popular prediction methods and are widely adopted in commercial collaborative filtering systems [12, 16]. the most analyzed examples of memory-based collaborative filtering include userbased approaches [***, 7, 10, 22] and item-based approaches [5, 12, 17]. user-based approaches predict the ratings of active users based on the ratings of similar users found, and itembased approaches predict the ratings of active users based on the information of similar items computed - user-based approaches predict the ratings of active users based on the ratings of similar users found, and itembased approaches predict the ratings of active users based on the information of similar items computed. user-based and item-based approaches often use pcc algorithm [16] and vss algorithm [***] as the similarity computation methods. pcc-based collaborative filtering generally can achieve higher performance than the other popular algorithm vss, since it considers the differences of user rating styles - 3. similarity computation this section briefly introduces the similarity computation methods in traditional user-based and item-based collaborative filtering [***, 5, 7, 17] as well as the method proposed in this paper. given a recommendation system consists of m users and n items, the relationship between users and items is denoted by an m~n matrix, called the user-item matrix - 3. 2 significanceweighting pcc-based collaborative filtering generally can achieve higher performance than other popular algorithms like vss [***], since it considers the factor of the differences of user rating styles. however pcc will overestimate the similarities of users who happen to have rated a few items identically, but may not have similar overall preferences [13] - table 4 summarizes our experimental results. we compare with the following algorithms: similarity fusion (sf) [21], smoothing and cluster-based pcc (scbpcc) [22], the aspect model (am) [9], personality diagnosis (pd) [14] and the user-based pcc [***]. our method outperforms all other competitive algorithms in various configurations influence:1 type:2 pair index:16 citer id:53 citer title:effective missing data prediction for collaborative filtering citer abstract:memory-based collaborative filtering algorithms have been widely adopted in many popular recommender systems, although these approaches all suffer from data sparsity and poor prediction quality problems. usually, the user-item matrix is quite sparse, which directly leads to inaccurate recommendations. this paper focuses the memory-based collaborative filtering problems on two crucial factors: (1) similarity computation between users or items and (2) missing data prediction algorithms. first, we use the enhanced pearson correlation coefficient (pcc) algorithm by adding one parameter which overcomes the potential decrease of accuracy when computing the similarity of users or items. second, we propose an effective missing data prediction algorithm, in which information of both users and items is taken into account. in this algorithm, we set the similarity threshold for users and items respectively, and the prediction algorithm will determine whether predicting the missing data or not. we also address how to predict the missing data by employing a combination of user and item information. finally, empirical studies on dataset movielens have shown that our newly proposed method outperforms other stateofthe-art collaborative filtering algorithms and it is more robust against data sparsity citee id:41 citee title:collaborative filtering with privacy via factor analysis citee abstract:collaborative filtering (cf) is valuable in e-commerce, and for direct recommendations for music, movies, news etc. but today's systems have several disadvantages, including privacy risks. as we move toward ubiquitous computing, there is a great potential for individuals to share all kinds of information about places and things to do, see and buy, but the privacy risks are severe. in this paper we describe a new method for collaborative filtering which protects the privacy of individual data. the method is based on a probabilistic factor analysis model. privacy protection is provided by a peer-to-peer protocol which is described elsewhere, but outlined in this paper. the factor analysis approach handles missing data without requiring default values for them. we give several experiments that suggest that this is most accurate method for cf to date. the new algorithm has other advantages in speed and storage over previous algorithms. finally, we suggest applications of the approach to other kinds of statistical analyses of survey or questionaire data. surrounding text:2 model-based approaches in the model-based approaches, training datasets are used to train a predefined model. examples of model-based approaches include clustering models [11, 20, 22], aspect models [8, 9, 19] and latent factor model [***]. [11] presented an algorithm for collaborative filtering based on hierarchical clustering, which tried to balance robustness and accuracy of predictions, especially when little data were available influence:1 type:2 pair index:17 citer id:53 citer title:effective missing data prediction for collaborative filtering citer abstract:memory-based collaborative filtering algorithms have been widely adopted in many popular recommender systems, although these approaches all suffer from data sparsity and poor prediction quality problems. usually, the user-item matrix is quite sparse, which directly leads to inaccurate recommendations. this paper focuses the memory-based collaborative filtering problems on two crucial factors: (1) similarity computation between users or items and (2) missing data prediction algorithms. first, we use the enhanced pearson correlation coefficient (pcc) algorithm by adding one parameter which overcomes the potential decrease of accuracy when computing the similarity of users or items. second, we propose an effective missing data prediction algorithm, in which information of both users and items is taken into account. in this algorithm, we set the similarity threshold for users and items respectively, and the prediction algorithm will determine whether predicting the missing data or not. we also address how to predict the missing data by employing a combination of user and item information. finally, empirical studies on dataset movielens have shown that our newly proposed method outperforms other stateofthe-art collaborative filtering algorithms and it is more robust against data sparsity citee id:44 citee title:combining content-based and collaborative filters in an online newspaper citee abstract:the explosive growth of mailing lists, web sites and usenet news demands effective filtering solutions. collaborative filtering combines the informed opinions of humans to make personalized, accurate predictions. content-based filtering uses the speed of computers to make complete, fast predictions. in this work, we present a new filtering approach that combines the coverage and speed of content-filters with the depth of collaborative filtering. we apply our research approach to an online newspaper, an as yet untapped opportunity for filters useful to the wide-spread news reading populace. we present the design of our filtering system and describe the results from preliminary experiments that suggest merits to our approach. surrounding text:3 other related approaches in order to take the advantages of memory-based and model-based approaches, hybrid collaborative filtering methods have been studied recently [14, 22]. [1, ***] unified collaborative filtering and content-based filtering, which achieved significant improvements over the standard approaches. at the same time, in order to solve the data sparsity problem, researchers proposed dimensionality reduction approaches in [15] influence:1 type:2 pair index:18 citer id:53 citer title:effective missing data prediction for collaborative filtering citer abstract:memory-based collaborative filtering algorithms have been widely adopted in many popular recommender systems, although these approaches all suffer from data sparsity and poor prediction quality problems. usually, the user-item matrix is quite sparse, which directly leads to inaccurate recommendations. this paper focuses the memory-based collaborative filtering problems on two crucial factors: (1) similarity computation between users or items and (2) missing data prediction algorithms. first, we use the enhanced pearson correlation coefficient (pcc) algorithm by adding one parameter which overcomes the potential decrease of accuracy when computing the similarity of users or items. second, we propose an effective missing data prediction algorithm, in which information of both users and items is taken into account. in this algorithm, we set the similarity threshold for users and items respectively, and the prediction algorithm will determine whether predicting the missing data or not. we also address how to predict the missing data by employing a combination of user and item information. finally, empirical studies on dataset movielens have shown that our newly proposed method outperforms other stateofthe-art collaborative filtering algorithms and it is more robust against data sparsity citee id:55 citee title:item-based top-n recommendation citee abstract:the explosive growth of the world-wide-web and the emergence of e-commerce has led to the development of recommender systems---a personalized information filtering technology used to identify a set of items that will be of interest to a certain user. user-based collaborative filtering is the most successful technology for building recommender systems to date and is extensively used in many commercial recommender systems. unfortunately, the computational complexity of these methods grows linearly with the number of customers, which in typical commercial applications can be several millions. to address these scalability concerns model-based recommendation techniques have been developed. these techniques analyze the user--item matrix to discover relations between the different items and use these relations to compute the list of recommendations.in this article, we present one such class of model-based recommendation algorithms that first determines the similarities between the various items and then uses them to identify the set of items to be recommended. the key steps in this class of algorithms are (i) the method used to compute the similarity between the items, and (ii) the method used to combine these similarities in order to compute the similarity between a basket of items and a candidate recommender item. our experimental evaluation on eight real datasets shows that these item-based algorithms are up to two orders of magnitude faster than the traditional user-neighborhood based recommender systems and provide recommendations with comparable or better quality surrounding text:the research of collaborative filtering started from memory-based approaches which utilize the entire user-item database to generate a prediction based on user or item similarity. two types of memory-based methods have been studied: user-based [2, 7, 10, 22] and item-based [***, 12, 17]. user-based methods first look for some similar users who have similar rating styles with the active user and then employ the ratings from those similar users to predict the ratings for the active user - 1 memory-based approaches the memory-based approaches are the most popular prediction methods and are widely adopted in commercial collaborative filtering systems [12, 16]. the most analyzed examples of memory-based collaborative filtering include userbased approaches [2, 7, 10, 22] and item-based approaches [***, 12, 17]. user-based approaches predict the ratings of active users based on the ratings of similar users found, and itembased approaches predict the ratings of active users based on the information of similar items computed - 3. similarity computation this section briefly introduces the similarity computation methods in traditional user-based and item-based collaborative filtering [2, ***, 7, 17] as well as the method proposed in this paper. given a recommendation system consists of m users and n items, the relationship between users and items is denoted by an m~n matrix, called the user-item matrix - from this definition, user similarity sim(a, u) is ranging from [0, 1], and a larger value means users a and u are more similar. item-based methods such as [***, 17] are similar to userbased approaches, and the difference is that item-based methods employ the similarity between the items instead of users. the basic idea in similarity computation between two items i and j is to first isolate the users who have rated both of these items and then apply a similarity computation technique to determine the similarity sim(i, j) [17] influence:1 type:2 pair index:19 citer id:53 citer title:effective missing data prediction for collaborative filtering citer abstract:memory-based collaborative filtering algorithms have been widely adopted in many popular recommender systems, although these approaches all suffer from data sparsity and poor prediction quality problems. usually, the user-item matrix is quite sparse, which directly leads to inaccurate recommendations. this paper focuses the memory-based collaborative filtering problems on two crucial factors: (1) similarity computation between users or items and (2) missing data prediction algorithms. first, we use the enhanced pearson correlation coefficient (pcc) algorithm by adding one parameter which overcomes the potential decrease of accuracy when computing the similarity of users or items. second, we propose an effective missing data prediction algorithm, in which information of both users and items is taken into account. in this algorithm, we set the similarity threshold for users and items respectively, and the prediction algorithm will determine whether predicting the missing data or not. we also address how to predict the missing data by employing a combination of user and item information. finally, empirical studies on dataset movielens have shown that our newly proposed method outperforms other stateofthe-art collaborative filtering algorithms and it is more robust against data sparsity citee id:9 citee title:an empirical analysis of design choices in neighborhood-based collaborative filtering algorithms citee abstract:collaborative filtering systems predict a user's interest in new items based on the recommendations of other people with similar interests. instead of performing content indexing or content analysis, collaborative filtering systems rely entirely on interest ratings from members of a participating community. since predictions are based on human ratings, collaborative filtering systems have the potential to provide filtering based on complex attributes, such as quality, taste, or aesthetics. many implementations of collaborative filtering apply some variation of the neighborhood-based prediction algorithm. many variations of similarity metrics, weighting approaches, combination measures, and rating normalization have appeared in each implementation. for these parameters and others, there is no consensus as to which choice of technique is most appropriate for what situations, nor how significant an effect on accuracy each parameter has. consequently, every person implementing a collaborative filtering system must make hard design choices with little guidance. this article provides a set of recommendations to guide design of neighborhood-based prediction systems, based on the results of an empirical study. we apply an analysis framework that divides the neighborhood-based prediction approach into three components and then examines variants of the key parameters in each component. the three components identified are similarity computation, neighbor selection, and rating combination. surrounding text:herlocker et sigir 2007 proceedings session 2: routing and filtering 40 al. [***, 7] proposed to add a correlation significance weighting factor that would devalue similarity weights that were based on a small number of co-rated items. herlockerfs latest research work [13] proposed to use the following modified similarity computation equation: sim(a, u) = max(ia iu, ) e sim(a, u) influence:1 type:2 pair index:20 citer id:53 citer title:effective missing data prediction for collaborative filtering citer abstract:memory-based collaborative filtering algorithms have been widely adopted in many popular recommender systems, although these approaches all suffer from data sparsity and poor prediction quality problems. usually, the user-item matrix is quite sparse, which directly leads to inaccurate recommendations. this paper focuses the memory-based collaborative filtering problems on two crucial factors: (1) similarity computation between users or items and (2) missing data prediction algorithms. first, we use the enhanced pearson correlation coefficient (pcc) algorithm by adding one parameter which overcomes the potential decrease of accuracy when computing the similarity of users or items. second, we propose an effective missing data prediction algorithm, in which information of both users and items is taken into account. in this algorithm, we set the similarity threshold for users and items respectively, and the prediction algorithm will determine whether predicting the missing data or not. we also address how to predict the missing data by employing a combination of user and item information. finally, empirical studies on dataset movielens have shown that our newly proposed method outperforms other stateofthe-art collaborative filtering algorithms and it is more robust against data sparsity citee id:32 citee title:collaborative filtering via gaussian probabilistic latent semantic analysis citee abstract:collaborative filtering aims at learning predictive models of user preferences, interests or behavior from community data, i.e. a database of available user preferences. in this paper, we describe a new model-based algorithm designed for this task, which is based on a generalization of probabilistic latent semantic analysis to continuous-valued response variables. more specifically, we assume that the observed user ratings can be modeled as a mixture of user communities or interest groups, where users may participate probabilistically in one or more groups. each community is characterized by a gaussian distribution on the normalized ratings for each item. the normalization of ratings is performed in a user-specific manner to account for variations in absolute shift and variance of ratings. experiments on the eachmovie data set show that the proposed approach compares favorably with other collaborative filtering techniques surrounding text:2 model-based approaches in the model-based approaches, training datasets are used to train a predefined model. examples of model-based approaches include clustering models [11, 20, 22], aspect models [***, 9, 19] and latent factor model [3]. [11] presented an algorithm for collaborative filtering based on hierarchical clustering, which tried to balance robustness and accuracy of predictions, especially when little data were available - [11] presented an algorithm for collaborative filtering based on hierarchical clustering, which tried to balance robustness and accuracy of predictions, especially when little data were available. authors in [***] proposed an algorithm based on a generalization of probabilistic latent semantic analysis to continuousvalued response variables. the model-based approaches are often time-consuming to build and update, and cannot cover as diverse a user range as the memory-based approaches do [22] influence:1 type:2 pair index:21 citer id:53 citer title:effective missing data prediction for collaborative filtering citer abstract:memory-based collaborative filtering algorithms have been widely adopted in many popular recommender systems, although these approaches all suffer from data sparsity and poor prediction quality problems. usually, the user-item matrix is quite sparse, which directly leads to inaccurate recommendations. this paper focuses the memory-based collaborative filtering problems on two crucial factors: (1) similarity computation between users or items and (2) missing data prediction algorithms. first, we use the enhanced pearson correlation coefficient (pcc) algorithm by adding one parameter which overcomes the potential decrease of accuracy when computing the similarity of users or items. second, we propose an effective missing data prediction algorithm, in which information of both users and items is taken into account. in this algorithm, we set the similarity threshold for users and items respectively, and the prediction algorithm will determine whether predicting the missing data or not. we also address how to predict the missing data by employing a combination of user and item information. finally, empirical studies on dataset movielens have shown that our newly proposed method outperforms other stateofthe-art collaborative filtering algorithms and it is more robust against data sparsity citee id:56 citee title:latent semantic models for collaborative filtering citee abstract:collaborative filtering aims at learning predictive models of user preferences, interests or behavior from community data, that is, a database of available user preferences. in this article, we describe a new family of model-based algorithms designed for this task. these algorithms rely on a statistical modelling technique that introduces latent class variables in a mixture model setting to discover user communities and prototypical interest profiles. we investigate several variations to deal with discrete and continuous response variables as well as with different objective functions. the main advantages of this technique over standard memory-based methods are higher accuracy, constant time prediction, and an explicit and compact model representation. the latter can also be used to mine for user communitites. the experimental evaluation shows that substantial improvements in accucracy over existing methods and published results can be obtained. surrounding text:2 model-based approaches in the model-based approaches, training datasets are used to train a predefined model. examples of model-based approaches include clustering models [11, 20, 22], aspect models [8, ***, 19] and latent factor model [3]. [11] presented an algorithm for collaborative filtering based on hierarchical clustering, which tried to balance robustness and accuracy of predictions, especially when little data were available - table 4 summarizes our experimental results. we compare with the following algorithms: similarity fusion (sf) [21], smoothing and cluster-based pcc (scbpcc) [22], the aspect model (am) [***], personality diagnosis (pd) [14] and the user-based pcc [2]. our method outperforms all other competitive algorithms in various configurations influence:1 type:2 pair index:22 citer id:53 citer title:effective missing data prediction for collaborative filtering citer abstract:memory-based collaborative filtering algorithms have been widely adopted in many popular recommender systems, although these approaches all suffer from data sparsity and poor prediction quality problems. usually, the user-item matrix is quite sparse, which directly leads to inaccurate recommendations. this paper focuses the memory-based collaborative filtering problems on two crucial factors: (1) similarity computation between users or items and (2) missing data prediction algorithms. first, we use the enhanced pearson correlation coefficient (pcc) algorithm by adding one parameter which overcomes the potential decrease of accuracy when computing the similarity of users or items. second, we propose an effective missing data prediction algorithm, in which information of both users and items is taken into account. in this algorithm, we set the similarity threshold for users and items respectively, and the prediction algorithm will determine whether predicting the missing data or not. we also address how to predict the missing data by employing a combination of user and item information. finally, empirical studies on dataset movielens have shown that our newly proposed method outperforms other stateofthe-art collaborative filtering algorithms and it is more robust against data sparsity citee id:7 citee title:an automatic weighting scheme for collaborative filtering citee abstract:collaborative filtering identifies information interest of a particular user based on the information provided by other similar users. the memory-based approaches for collaborative filtering (e.g., pearson correlation coefficient approach) identify the similarity between two users by comparing their ratings on a set of items. in these approaches, different items are weighted either equally or by some predefined functions. the impact of rating discrepancies among different users has not been taken into consideration. for example, an item that is highly favored by most users should have a smaller impact on the user-similarity than an item for which different types of users tend to give different ratings. even though simple weighting methods such as variance weighting try to address this problem, empirical studies have shown that they are ineffective in improving the performance of collaborative filtering. in this paper, we present an optimization algorithm to automatically compute the weights for different items based on their ratings from training users. more specifically, the new weighting scheme will create a clustered distribution for user vectors in the item space by bringing users of similar interests closer and separating users of different interests more distant. empirical studies over two datasets have shown that our new weighting scheme substantially improves the performance of the pearson correlation coefficient method for collaborative filtering. surrounding text:the research of collaborative filtering started from memory-based approaches which utilize the entire user-item database to generate a prediction based on user or item similarity. two types of memory-based methods have been studied: user-based [2, 7, ***, 22] and item-based [5, 12, 17]. user-based methods first look for some similar users who have similar rating styles with the active user and then employ the ratings from those similar users to predict the ratings for the active user - 1 memory-based approaches the memory-based approaches are the most popular prediction methods and are widely adopted in commercial collaborative filtering systems [12, 16]. the most analyzed examples of memory-based collaborative filtering include userbased approaches [2, 7, ***, 22] and item-based approaches [5, 12, 17]. user-based approaches predict the ratings of active users based on the ratings of similar users found, and itembased approaches predict the ratings of active users based on the information of similar items computed influence:1 type:2 pair index:23 citer id:53 citer title:effective missing data prediction for collaborative filtering citer abstract:memory-based collaborative filtering algorithms have been widely adopted in many popular recommender systems, although these approaches all suffer from data sparsity and poor prediction quality problems. usually, the user-item matrix is quite sparse, which directly leads to inaccurate recommendations. this paper focuses the memory-based collaborative filtering problems on two crucial factors: (1) similarity computation between users or items and (2) missing data prediction algorithms. first, we use the enhanced pearson correlation coefficient (pcc) algorithm by adding one parameter which overcomes the potential decrease of accuracy when computing the similarity of users or items. second, we propose an effective missing data prediction algorithm, in which information of both users and items is taken into account. in this algorithm, we set the similarity threshold for users and items respectively, and the prediction algorithm will determine whether predicting the missing data or not. we also address how to predict the missing data by employing a combination of user and item information. finally, empirical studies on dataset movielens have shown that our newly proposed method outperforms other stateofthe-art collaborative filtering algorithms and it is more robust against data sparsity citee id:24 citee title:clustering for collaborative filtering applications citee abstract:collaborative filtering systems assist users to identify items of interest by providing predictions based on ratings of other users. the quality of the predictions depends strongly on the amount of available ratings and collaborative filtering algorithms perform poorly when only few ratings are available. in this paper we identify two important situations with sparse ratings: bootstrapping a collaborative filtering system with few users and providing recommendations for new users, who rated surrounding text:2 model-based approaches in the model-based approaches, training datasets are used to train a predefined model. examples of model-based approaches include clustering models [***, 20, 22], aspect models [8, 9, 19] and latent factor model [3]. [***] presented an algorithm for collaborative filtering based on hierarchical clustering, which tried to balance robustness and accuracy of predictions, especially when little data were available - examples of model-based approaches include clustering models [***, 20, 22], aspect models [8, 9, 19] and latent factor model [3]. [***] presented an algorithm for collaborative filtering based on hierarchical clustering, which tried to balance robustness and accuracy of predictions, especially when little data were available. authors in [8] proposed an algorithm based on a generalization of probabilistic latent semantic analysis to continuousvalued response variables influence:1 type:2 pair index:24 citer id:53 citer title:effective missing data prediction for collaborative filtering citer abstract:memory-based collaborative filtering algorithms have been widely adopted in many popular recommender systems, although these approaches all suffer from data sparsity and poor prediction quality problems. usually, the user-item matrix is quite sparse, which directly leads to inaccurate recommendations. this paper focuses the memory-based collaborative filtering problems on two crucial factors: (1) similarity computation between users or items and (2) missing data prediction algorithms. first, we use the enhanced pearson correlation coefficient (pcc) algorithm by adding one parameter which overcomes the potential decrease of accuracy when computing the similarity of users or items. second, we propose an effective missing data prediction algorithm, in which information of both users and items is taken into account. in this algorithm, we set the similarity threshold for users and items respectively, and the prediction algorithm will determine whether predicting the missing data or not. we also address how to predict the missing data by employing a combination of user and item information. finally, empirical studies on dataset movielens have shown that our newly proposed method outperforms other stateofthe-art collaborative filtering algorithms and it is more robust against data sparsity citee id:6 citee title:amazon.com recommendations: item-to-item collaborative filtering citee abstract:recommendation algorithms are best known for their use on e-commerce web sites,1 where they use input about a customers interests to generate a list of recommended items. many applications use only the items that customers purchase and explicitly rate to represent their interests, but they can also use other attributes, including items viewed, demographic data, subject interests, and favorite artists. at amazon.com, we use recommendation algorithms to personalize the online store for each customer. the store radically changes based on customer interests, showing programming titles to a software engineer and baby toys to a new mother. the click-through and conversion rates two important measures of web-based and email advertising effectiveness vastly exceed those of untargeted content such as banner advertisements and top-seller lists. surrounding text:the research of collaborative filtering started from memory-based approaches which utilize the entire user-item database to generate a prediction based on user or item similarity. two types of memory-based methods have been studied: user-based [2, 7, 10, 22] and item-based [5, ***, 17]. user-based methods first look for some similar users who have similar rating styles with the active user and then employ the ratings from those similar users to predict the ratings for the active user - notable similarity computation algorithms include pearson correlation coefficient (pcc) [16] and vector space similarity (vss) algorithm [2]. although memory-based approaches have been widely used in recommendation systems [***, 16], the problem of inaccurate recommendation results still exists in both user-based and item-based approaches. the fundamental problem of memory-based approaches is the data sparsity of the useritem matrix - 2. 1 memory-based approaches the memory-based approaches are the most popular prediction methods and are widely adopted in commercial collaborative filtering systems [***, 16]. the most analyzed examples of memory-based collaborative filtering include userbased approaches [2, 7, 10, 22] and item-based approaches [5, ***, 17] - 1 memory-based approaches the memory-based approaches are the most popular prediction methods and are widely adopted in commercial collaborative filtering systems [***, 16]. the most analyzed examples of memory-based collaborative filtering include userbased approaches [2, 7, 10, 22] and item-based approaches [5, ***, 17]. user-based approaches predict the ratings of active users based on the ratings of similar users found, and itembased approaches predict the ratings of active users based on the information of similar items computed influence:1 type:2 pair index:25 citer id:53 citer title:effective missing data prediction for collaborative filtering citer abstract:memory-based collaborative filtering algorithms have been widely adopted in many popular recommender systems, although these approaches all suffer from data sparsity and poor prediction quality problems. usually, the user-item matrix is quite sparse, which directly leads to inaccurate recommendations. this paper focuses the memory-based collaborative filtering problems on two crucial factors: (1) similarity computation between users or items and (2) missing data prediction algorithms. first, we use the enhanced pearson correlation coefficient (pcc) algorithm by adding one parameter which overcomes the potential decrease of accuracy when computing the similarity of users or items. second, we propose an effective missing data prediction algorithm, in which information of both users and items is taken into account. in this algorithm, we set the similarity threshold for users and items respectively, and the prediction algorithm will determine whether predicting the missing data or not. we also address how to predict the missing data by employing a combination of user and item information. finally, empirical studies on dataset movielens have shown that our newly proposed method outperforms other stateofthe-art collaborative filtering algorithms and it is more robust against data sparsity citee id:0 citee title:a collaborative filtering algorithm and evaluation metric that accurately model the user experience citee abstract:collaborative filtering (cf) systems have been researched for over a decade as a tool to deal with information overload. at the heart of these systems are the algorithms which generate the predictions and recommendations.in this article we empirically demonstrate that two of the most acclaimed cf recommendation algorithms have flaws that result in a dramatically unacceptable user experience.in response, we introduce a new belief distribution algorithm that overcomes these flaws and provides substantially richer user modeling. the belief distribution algorithm retains the qualities of nearest-neighbor algorithms which have performed well in the past, yet produces predictions of belief distributions across rating values rather than a point rating value.in addition, we illustrate how the exclusive use of the mean absolute error metric has concealed these flaws for so long, and we propose the use of a modified precision metric for more accurately evaluating the user experience. surrounding text:2 significanceweighting pcc-based collaborative filtering generally can achieve higher performance than other popular algorithms like vss [2], since it considers the factor of the differences of user rating styles. however pcc will overestimate the similarities of users who happen to have rated a few items identically, but may not have similar overall preferences [***]. herlocker et sigir 2007 proceedings session 2: routing and filtering 40 al - [6, 7] proposed to add a correlation significance weighting factor that would devalue similarity weights that were based on a small number of co-rated items. herlockerfs latest research work [***] proposed to use the following modified similarity computation equation: sim(a, u) = max(ia iu, ) e sim(a, u). (3) this equation overcomes the problem when only few items are rated in common but in case that when ia iu is much higher than , the similarity sim(a, u) will be larger than 1, and even surpass 2 or 3 in worse cases influence:1 type:2 pair index:26 citer id:53 citer title:effective missing data prediction for collaborative filtering citer abstract:memory-based collaborative filtering algorithms have been widely adopted in many popular recommender systems, although these approaches all suffer from data sparsity and poor prediction quality problems. usually, the user-item matrix is quite sparse, which directly leads to inaccurate recommendations. this paper focuses the memory-based collaborative filtering problems on two crucial factors: (1) similarity computation between users or items and (2) missing data prediction algorithms. first, we use the enhanced pearson correlation coefficient (pcc) algorithm by adding one parameter which overcomes the potential decrease of accuracy when computing the similarity of users or items. second, we propose an effective missing data prediction algorithm, in which information of both users and items is taken into account. in this algorithm, we set the similarity threshold for users and items respectively, and the prediction algorithm will determine whether predicting the missing data or not. we also address how to predict the missing data by employing a combination of user and item information. finally, empirical studies on dataset movielens have shown that our newly proposed method outperforms other stateofthe-art collaborative filtering algorithms and it is more robust against data sparsity citee id:31 citee title:collaborative filtering by personality diagnosis: a hybrid memory- and model-based approach citee abstract:the growth of internet commerce has stimulated the use of collaborative filtering (cf) algorithms as recommender systems. such systems leverage knowledge about the known preferences of multiple users to recommend items of interest to other users. cf methods have been harnessed to make recommendations about such items as web pages, movies, books, and toys. researchers have proposed and evaluated many approaches for generating recommendations. we describe and evaluate a new method called personality diagnosis (pd). given a users preferences for some items, we compute the probability that he or she is of the same personality type as other users, and, in turn, the probability that he or she will like new items. pd retains some of the advantages of traditional similarity-weighting techniques in that all data is brought to bear on each prediction and new data can be added easily and incrementally. additionally, pd has a meaningful probabilistic interpretation, which may be leveraged to justify, explain, and augment results. we report empirical results on the eachmovie database of movie ratings, and on user profile data collected from the citeseer digital library of computer science research papers. the probabilistic framework naturally supports a variety of descriptive measurementsin particular, we consider the applicability of a value of information (voi) computation. surrounding text:2. 3 other related approaches in order to take the advantages of memory-based and model-based approaches, hybrid collaborative filtering methods have been studied recently [***, 22]. [1, 4] unified collaborative filtering and content-based filtering, which achieved significant improvements over the standard approaches - table 4 summarizes our experimental results. we compare with the following algorithms: similarity fusion (sf) [21], smoothing and cluster-based pcc (scbpcc) [22], the aspect model (am) [9], personality diagnosis (pd) [***] and the user-based pcc [2]. our method outperforms all other competitive algorithms in various configurations influence:1 type:2 pair index:27 citer id:53 citer title:effective missing data prediction for collaborative filtering citer abstract:memory-based collaborative filtering algorithms have been widely adopted in many popular recommender systems, although these approaches all suffer from data sparsity and poor prediction quality problems. usually, the user-item matrix is quite sparse, which directly leads to inaccurate recommendations. this paper focuses the memory-based collaborative filtering problems on two crucial factors: (1) similarity computation between users or items and (2) missing data prediction algorithms. first, we use the enhanced pearson correlation coefficient (pcc) algorithm by adding one parameter which overcomes the potential decrease of accuracy when computing the similarity of users or items. second, we propose an effective missing data prediction algorithm, in which information of both users and items is taken into account. in this algorithm, we set the similarity threshold for users and items respectively, and the prediction algorithm will determine whether predicting the missing data or not. we also address how to predict the missing data by employing a combination of user and item information. finally, empirical studies on dataset movielens have shown that our newly proposed method outperforms other stateofthe-art collaborative filtering algorithms and it is more robust against data sparsity citee id:57 citee title:fast maximum margin matrix factorization for collaborative prediction citee abstract:maximum margin matrix factorization (mmmf) was recently suggested (srebro et al., 2005) as a convex, infinite dimensional alternative to low-rank approximations and standard factor models. mmmf can be formulated as a semi-definite programming (sdp) and learned using standard sdp solvers. however, current sdp solvers can only handle mmmf problems on matrices of dimensionality up to a few hundred. here, we investigate a direct gradient-based optimization method for mmmf and demonstrate it on large collaborative prediction problems. we compare against results obtained by marlin (2004) and find that mmmf substantially outperforms all nine methods he tested surrounding text:[1, 4] unified collaborative filtering and content-based filtering, which achieved significant improvements over the standard approaches. at the same time, in order to solve the data sparsity problem, researchers proposed dimensionality reduction approaches in [***]. the dimensionality-reduction approach addressed the sparsity problem by deleting unrelated or insignificant users or items, which would discard some information of the user-item matrix influence:2 type:2 pair index:28 citer id:53 citer title:effective missing data prediction for collaborative filtering citer abstract:memory-based collaborative filtering algorithms have been widely adopted in many popular recommender systems, although these approaches all suffer from data sparsity and poor prediction quality problems. usually, the user-item matrix is quite sparse, which directly leads to inaccurate recommendations. this paper focuses the memory-based collaborative filtering problems on two crucial factors: (1) similarity computation between users or items and (2) missing data prediction algorithms. first, we use the enhanced pearson correlation coefficient (pcc) algorithm by adding one parameter which overcomes the potential decrease of accuracy when computing the similarity of users or items. second, we propose an effective missing data prediction algorithm, in which information of both users and items is taken into account. in this algorithm, we set the similarity threshold for users and items respectively, and the prediction algorithm will determine whether predicting the missing data or not. we also address how to predict the missing data by employing a combination of user and item information. finally, empirical studies on dataset movielens have shown that our newly proposed method outperforms other stateofthe-art collaborative filtering algorithms and it is more robust against data sparsity citee id:33 citee title:grouplens: an open architecture for collaborative filtering of netnews citee abstract:collaborative filters help people make choices based on the opinions of other people. grouplens is a system for collaborative filtering of netnews, to help people find articles they will like in the huge stream of available articles. news reader clients display predicted scores and make it easy for users to rate articles after they read them. rating servers, called better bit bureaus, gather and disseminate the ratings. the rating servers predict scores based on the heuristic that people who surrounding text:whether in user-based approaches or in item-based approaches, the computation of similarity between users or items is a very critical step. notable similarity computation algorithms include pearson correlation coefficient (pcc) [***] and vector space similarity (vss) algorithm [2]. although memory-based approaches have been widely used in recommendation systems [12, ***], the problem of inaccurate recommendation results still exists in both user-based and item-based approaches - notable similarity computation algorithms include pearson correlation coefficient (pcc) [***] and vector space similarity (vss) algorithm [2]. although memory-based approaches have been widely used in recommendation systems [12, ***], the problem of inaccurate recommendation results still exists in both user-based and item-based approaches. the fundamental problem of memory-based approaches is the data sparsity of the useritem matrix - 2. 1 memory-based approaches the memory-based approaches are the most popular prediction methods and are widely adopted in commercial collaborative filtering systems [12, ***]. the most analyzed examples of memory-based collaborative filtering include userbased approaches [2, 7, 10, 22] and item-based approaches [5, 12, 17] - user-based approaches predict the ratings of active users based on the ratings of similar users found, and itembased approaches predict the ratings of active users based on the information of similar items computed. user-based and item-based approaches often use pcc algorithm [***] and vss algorithm [2] as the similarity computation methods. pcc-based collaborative filtering generally can achieve higher performance than the other popular algorithm vss, since it considers the differences of user rating styles influence:1 type:2 pair index:29 citer id:53 citer title:effective missing data prediction for collaborative filtering citer abstract:memory-based collaborative filtering algorithms have been widely adopted in many popular recommender systems, although these approaches all suffer from data sparsity and poor prediction quality problems. usually, the user-item matrix is quite sparse, which directly leads to inaccurate recommendations. this paper focuses the memory-based collaborative filtering problems on two crucial factors: (1) similarity computation between users or items and (2) missing data prediction algorithms. first, we use the enhanced pearson correlation coefficient (pcc) algorithm by adding one parameter which overcomes the potential decrease of accuracy when computing the similarity of users or items. second, we propose an effective missing data prediction algorithm, in which information of both users and items is taken into account. in this algorithm, we set the similarity threshold for users and items respectively, and the prediction algorithm will determine whether predicting the missing data or not. we also address how to predict the missing data by employing a combination of user and item information. finally, empirical studies on dataset movielens have shown that our newly proposed method outperforms other stateofthe-art collaborative filtering algorithms and it is more robust against data sparsity citee id:58 citee title:item-based collaborative filtering recommendation algorithms citee abstract:recommender systems apply knowledge discovery techniques to the problem of making personalized recommendations for information, products or services during a live interaction. these systems, especially the k-nearest neighbor collaborative filtering based ones, are achieving widespread success on the web. the tremendous growth in the amount of available information and the number of visitors to web sites in recent years poses some key challenges for recommender systems. these are: producing high quality recommendations, performing many recommendations per second for millions of users and items and achieving high coverage in the face of data sparsity. in traditional collaborative filtering systems the amount of work increases with the number of participants in the system. new recommender system technologies are needed that can quickly produce high quality recommendations, even for very large-scale problems. to address these issues we have explored item-based collaborative filtering techniques. item-based techniques first analyze the user-item matrix to identify relationships between di erent items, and then use these relationships to indirectly compute recommendations for users. in this paper we analyze di erent item-based recommendation generation algorithms. we look into di erent techniques for computing item-item similarities (e.g., item-item correlation vs. cosine similarities between item vectors) and di erent techniques for obtaining recommendations from them (e.g., weighted sum vs. regression model). finally, we experimentally evaluate our results and compare them to the basic k-nearest neighbor approach. our experiments suggest that item-based algorithms provide dramatically better performance than user-based algorithms, while at the same time providing better quality than the best available userbased algorithms surrounding text:the research of collaborative filtering started from memory-based approaches which utilize the entire user-item database to generate a prediction based on user or item similarity. two types of memory-based methods have been studied: user-based [2, 7, 10, 22] and item-based [5, 12, ***]. user-based methods first look for some similar users who have similar rating styles with the active user and then employ the ratings from those similar users to predict the ratings for the active user - 1 memory-based approaches the memory-based approaches are the most popular prediction methods and are widely adopted in commercial collaborative filtering systems [12, 16]. the most analyzed examples of memory-based collaborative filtering include userbased approaches [2, 7, 10, 22] and item-based approaches [5, 12, ***]. user-based approaches predict the ratings of active users based on the ratings of similar users found, and itembased approaches predict the ratings of active users based on the information of similar items computed - 3. similarity computation this section briefly introduces the similarity computation methods in traditional user-based and item-based collaborative filtering [2, 5, 7, ***] as well as the method proposed in this paper. given a recommendation system consists of m users and n items, the relationship between users and items is denoted by an m~n matrix, called the user-item matrix - from this definition, user similarity sim(a, u) is ranging from [0, 1], and a larger value means users a and u are more similar. item-based methods such as [5, ***] are similar to userbased approaches, and the difference is that item-based methods employ the similarity between the items instead of users. the basic idea in similarity computation between two items i and j is to first isolate the users who have rated both of these items and then apply a similarity computation technique to determine the similarity sim(i, j) [***] - item-based methods such as [5, ***] are similar to userbased approaches, and the difference is that item-based methods employ the similarity between the items instead of users. the basic idea in similarity computation between two items i and j is to first isolate the users who have rated both of these items and then apply a similarity computation technique to determine the similarity sim(i, j) [***]. the pccbased similarity computation between two items i and j can be described as: sim(i, j) = fi uu(i)u(j) (ru,i - 4. collaborative filtering framework in pratice, the user-item matrix of commercial recommendation system is very sparse and the density of available ratings is often less than 1% [***]. sparse matrix directly leads to the prediction inaccuracy in traditional user-based or item-based collaborative filtering influence:1 type:2 pair index:30 citer id:53 citer title:effective missing data prediction for collaborative filtering citer abstract:memory-based collaborative filtering algorithms have been widely adopted in many popular recommender systems, although these approaches all suffer from data sparsity and poor prediction quality problems. usually, the user-item matrix is quite sparse, which directly leads to inaccurate recommendations. this paper focuses the memory-based collaborative filtering problems on two crucial factors: (1) similarity computation between users or items and (2) missing data prediction algorithms. first, we use the enhanced pearson correlation coefficient (pcc) algorithm by adding one parameter which overcomes the potential decrease of accuracy when computing the similarity of users or items. second, we propose an effective missing data prediction algorithm, in which information of both users and items is taken into account. in this algorithm, we set the similarity threshold for users and items respectively, and the prediction algorithm will determine whether predicting the missing data or not. we also address how to predict the missing data by employing a combination of user and item information. finally, empirical studies on dataset movielens have shown that our newly proposed method outperforms other stateofthe-art collaborative filtering algorithms and it is more robust against data sparsity citee id:59 citee title:flexible mixture model for collaborative filtering citee abstract:this paper presents a flexible mixture model (fmm) for collaborative filtering. fmm extends existing partitioning/clustering algorithms for collaborative filtering by clustering both users and items together simultaneously without assuming that each user and item should only belong to a single cluster. furthermore, with the introduction of preference nodes, the proposed framework is able to explicitly model how users rate items, which can vary dramatically, even among the users with similar tastes on items. empirical study over two datasets of movie ratings has shown that our new algorithm outperforms five other collaborative filtering algorithms substantially. surrounding text:2 model-based approaches in the model-based approaches, training datasets are used to train a predefined model. examples of model-based approaches include clustering models [11, 20, 22], aspect models [8, 9, ***] and latent factor model [3]. [11] presented an algorithm for collaborative filtering based on hierarchical clustering, which tried to balance robustness and accuracy of predictions, especially when little data were available influence:1 type:2 pair index:31 citer id:53 citer title:effective missing data prediction for collaborative filtering citer abstract:memory-based collaborative filtering algorithms have been widely adopted in many popular recommender systems, although these approaches all suffer from data sparsity and poor prediction quality problems. usually, the user-item matrix is quite sparse, which directly leads to inaccurate recommendations. this paper focuses the memory-based collaborative filtering problems on two crucial factors: (1) similarity computation between users or items and (2) missing data prediction algorithms. first, we use the enhanced pearson correlation coefficient (pcc) algorithm by adding one parameter which overcomes the potential decrease of accuracy when computing the similarity of users or items. second, we propose an effective missing data prediction algorithm, in which information of both users and items is taken into account. in this algorithm, we set the similarity threshold for users and items respectively, and the prediction algorithm will determine whether predicting the missing data or not. we also address how to predict the missing data by employing a combination of user and item information. finally, empirical studies on dataset movielens have shown that our newly proposed method outperforms other stateofthe-art collaborative filtering algorithms and it is more robust against data sparsity citee id:26 citee title:clustering methods for collaborative filtering citee abstract:grouping people into clusters based on the items they have purchased allows accurate recommendations of new items for purchase: if you and i have liked many of the same movies, then i will probably enjoy other movies that you like. recommending items based on similarity of interest (a.k.a. collaborative filtering) is attractive for many domains: books, cds, movies, etc., but does not always work well. because data are always sparse { any given person has seen only a small fraction of all movies { much more accurate predictions can be made by grouping people into clusters with similar movies and grouping movies into clusters which tend to be liked by the same people. finding optimal clusters is tricky because the movie groups should be used to help determine the people groups and visa versa. we present a formal statistical model of collaborative filtering, and compare di erent algorithms for estimating the model parameters including variations of k-means clustering and gibbs sampling. this formal model is easily extended to handle clustering of objects with multiple attributes surrounding text:2 model-based approaches in the model-based approaches, training datasets are used to train a predefined model. examples of model-based approaches include clustering models [11, ***, 22], aspect models [8, 9, 19] and latent factor model [3]. [11] presented an algorithm for collaborative filtering based on hierarchical clustering, which tried to balance robustness and accuracy of predictions, especially when little data were available influence:1 type:2 pair index:32 citer id:53 citer title:effective missing data prediction for collaborative filtering citer abstract:memory-based collaborative filtering algorithms have been widely adopted in many popular recommender systems, although these approaches all suffer from data sparsity and poor prediction quality problems. usually, the user-item matrix is quite sparse, which directly leads to inaccurate recommendations. this paper focuses the memory-based collaborative filtering problems on two crucial factors: (1) similarity computation between users or items and (2) missing data prediction algorithms. first, we use the enhanced pearson correlation coefficient (pcc) algorithm by adding one parameter which overcomes the potential decrease of accuracy when computing the similarity of users or items. second, we propose an effective missing data prediction algorithm, in which information of both users and items is taken into account. in this algorithm, we set the similarity threshold for users and items respectively, and the prediction algorithm will determine whether predicting the missing data or not. we also address how to predict the missing data by employing a combination of user and item information. finally, empirical studies on dataset movielens have shown that our newly proposed method outperforms other stateofthe-art collaborative filtering algorithms and it is more robust against data sparsity citee id:60 citee title:unifying user-based and item-based collaborative filtering approaches by similarity fusion citee abstract:memory-based methods for collaborative filtering predict new ratings by averaging (weighted) ratings between, respectively, pairs of similar users or items. in practice, a large number of ratings from similar users or similar items are not available, due to the sparsity inherent to rating data. consequently, prediction quality can be poor. this paper reformulates the memory-based collaborative filtering problem in a generative probabilistic framework, treating individual user-item ratings as predictors of missing ratings. the final rating is estimated by fusing predictions from three sources: predictions based on ratings of the same item by other users, predictions based on different item ratings made by the same user, and, third, ratings predicted based on data from other but similar users rating other but similar items. existing user-based and item-based approaches correspond to the two simple cases of our framework. the complete model is however more robust to data sparsity, because the different types of ratings are used in concert, while additional ratings from similar users towards similar items are employed as a background model to smooth the predictions. experiments demonstrate that the proposed methods are indeed more robust against data sparsity and give better recommendations. surrounding text:many recent algorithms have been proposed to alleviate the data sparsity problem. in [***], wang et al$ proposed a generative probabilistic framework to exploit more of the data available in the user-item matrix by fusing all ratings with a predictive value for a recommendation to be made. xue et al - given5 figure 2: mae comparison of emdp and pemd (a smallermae value means a better performance). next, in order to compare our approach with other stateofthe-arts algorithms, we follow the exact evaluation procedures which were described in [***, 22] by extracting a subset of 500 users with more than 40 ratings. table 4 summarizes our experimental results - table 4 summarizes our experimental results. we compare with the following algorithms: similarity fusion (sf) [***], smoothing and cluster-based pcc (scbpcc) [22], the aspect model (am) [9], personality diagnosis (pd) [14] and the user-based pcc [2]. our method outperforms all other competitive algorithms in various configurations influence:1 type:2 pair index:33 citer id:53 citer title:effective missing data prediction for collaborative filtering citer abstract:memory-based collaborative filtering algorithms have been widely adopted in many popular recommender systems, although these approaches all suffer from data sparsity and poor prediction quality problems. usually, the user-item matrix is quite sparse, which directly leads to inaccurate recommendations. this paper focuses the memory-based collaborative filtering problems on two crucial factors: (1) similarity computation between users or items and (2) missing data prediction algorithms. first, we use the enhanced pearson correlation coefficient (pcc) algorithm by adding one parameter which overcomes the potential decrease of accuracy when computing the similarity of users or items. second, we propose an effective missing data prediction algorithm, in which information of both users and items is taken into account. in this algorithm, we set the similarity threshold for users and items respectively, and the prediction algorithm will determine whether predicting the missing data or not. we also address how to predict the missing data by employing a combination of user and item information. finally, empirical studies on dataset movielens have shown that our newly proposed method outperforms other stateofthe-art collaborative filtering algorithms and it is more robust against data sparsity citee id:61 citee title:scalable collaborative filtering using cluster-based smoothing citee abstract:memory-based approaches for collaborative filtering identify the similarity between two users by comparing their ratings on a set of items. in the past, the memory-based approaches have been shown to suffer from two fundamental problems: data sparsity and difficulty in scalability. alternatively, the model-based approaches have been proposed to alleviate these problems, but these approaches tends to limit the range of users. in this paper, we present a novel approach that combines the advantages of these two kinds of approaches by introducing a smoothing-based method. in our approach, clusters generated from the training data provide the basis for data smoothing and neighborhood selection. as a result, we provide higher accuracy as well as increased efficiency in recommendations. empirical studies on two datasets (eachmovie and movielens) show that our new proposed approach consistently outperforms other state-of-the-art collaborative filtering algorithms surrounding text:the research of collaborative filtering started from memory-based approaches which utilize the entire user-item database to generate a prediction based on user or item similarity. two types of memory-based methods have been studied: user-based [2, 7, 10, ***] and item-based [5, 12, 17]. user-based methods first look for some similar users who have similar rating styles with the active user and then employ the ratings from those similar users to predict the ratings for the active user - xue et al. [***] proposed a framework for collaborative filtering which combines the strengths of memorybased approaches and model-based approaches by introducing a smoothing-based method, and solved the data sparsity problem by predicting all the missing data in a user-item matrix. although the simulation showed that this approach can achieve better performance than other collaborative filtering algorithms, the cluster-based smoothing algorithm limited the diversity of users in each cluster and predicting all the missing data in the user-item matrix could bring negative influence for the recommendation of active users - 1 memory-based approaches the memory-based approaches are the most popular prediction methods and are widely adopted in commercial collaborative filtering systems [12, 16]. the most analyzed examples of memory-based collaborative filtering include userbased approaches [2, 7, 10, ***] and item-based approaches [5, 12, 17]. user-based approaches predict the ratings of active users based on the ratings of similar users found, and itembased approaches predict the ratings of active users based on the information of similar items computed - 2 model-based approaches in the model-based approaches, training datasets are used to train a predefined model. examples of model-based approaches include clustering models [11, 20, ***], aspect models [8, 9, 19] and latent factor model [3]. [11] presented an algorithm for collaborative filtering based on hierarchical clustering, which tried to balance robustness and accuracy of predictions, especially when little data were available - authors in [8] proposed an algorithm based on a generalization of probabilistic latent semantic analysis to continuousvalued response variables. the model-based approaches are often time-consuming to build and update, and cannot cover as diverse a user range as the memory-based approaches do [***]. 2 - 2. 3 other related approaches in order to take the advantages of memory-based and model-based approaches, hybrid collaborative filtering methods have been studied recently [14, ***]. [1, 4] unified collaborative filtering and content-based filtering, which achieved significant improvements over the standard approaches - some work applies data smoothing methods to fill the missing values of the user-item matrix. in [***], xue et al. proposed a clusterbased smoothing method which clusters the users using kmeans first, and then predicts all the missing data based on the ratings of top-n most similar users in the similar clusters - given5 figure 2: mae comparison of emdp and pemd (a smallermae value means a better performance). next, in order to compare our approach with other stateofthe-arts algorithms, we follow the exact evaluation procedures which were described in [21, ***] by extracting a subset of 500 users with more than 40 ratings. table 4 summarizes our experimental results - table 4 summarizes our experimental results. we compare with the following algorithms: similarity fusion (sf) [21], smoothing and cluster-based pcc (scbpcc) [***], the aspect model (am) [9], personality diagnosis (pd) [14] and the user-based pcc [2]. our method outperforms all other competitive algorithms in various configurations influence:1 type:2 pair index:34 citer id:58 citer title:item-based collaborative filtering recommendation algorithms citer abstract:recommender systems apply knowledge discovery techniques to the problem of making personalized recommendations for information, products or services during a live interaction. these systems, especially the k-nearest neighbor collaborative filtering based ones, are achieving widespread success on the web. the tremendous growth in the amount of available information and the number of visitors to web sites in recent years poses some key challenges for recommender systems. these are: producing high quality recommendations, performing many recommendations per second for millions of users and items and achieving high coverage in the face of data sparsity. in traditional collaborative filtering systems the amount of work increases with the number of participants in the system. new recommender system technologies are needed that can quickly produce high quality recommendations, even for very large-scale problems. to address these issues we have explored item-based collaborative filtering techniques. item-based techniques first analyze the user-item matrix to identify relationships between di erent items, and then use these relationships to indirectly compute recommendations for users. in this paper we analyze di erent item-based recommendation generation algorithms. we look into di erent techniques for computing item-item similarities (e.g., item-item correlation vs. cosine similarities between item vectors) and di erent techniques for obtaining recommendations from them (e.g., weighted sum vs. regression model). finally, we experimentally evaluate our results and compare them to the basic k-nearest neighbor approach. our experiments suggest that item-based algorithms provide dramatically better performance than user-based algorithms, while at the same time providing better quality than the best available userbased algorithms citee id:84 citee title:horting hatches an egg: a new graph-theoretic approach to collaborative filtering citee abstract:this paper introduces a new and novel approach to ratingbased collaborative filtering. the new technique is most appropriate for e-commerce merchants offering one or more groups of relatively homogeneous items such as compact disks, videos, books, software and the like. in contrast with other known collaborative filtering techniques, the new algorithm is graph-theoretic, based on the twin new concepts of ho&rag and predictability. as is demonstrated in this paper, the technique is fast, scalable, accurate, and requires only a modest learning curve. it makes use of a hierarchical classification scheme in order to introduce context into the rating process, and uses so-called creative links in order to find surprising and atypical items to recommend, perhaps even items which cross the group boundaries. the new technique is one of the key engines of the intelligent recommendation algorithm (ira) project, now being developed at ibm research. in addition to several other recommendation engines, ira contains a situation analyzer to determine the most appropriate mix of engines for a particular e-commerce merchant, as well as an engine for optimizing the placement of advertisements. surrounding text:while dividing the population into clusters may hurt the accuracy or recommendations to users near the fringes of their assigned cluster, pre-clustering may be a worthwhile trade-o between accuracy and throughput. horting is a graph-based technique in which nodes are users, and edges between nodes indicate degree of similarity between two users [***]. predictions are produced by walking the graph to nearby nodes and combining the opinions of the nearby users - horting di ers from nearest neighbor as the graph may be walked through other users who have not rated the item in question, thus exploring transitive relationships that nearest neighbor algorithms do not consider. in one study using synthetic data, horting produced better predictions than a nearest neighbor algorithm [***]. schafer et al - 7. references [***] aggarwal, c. c influence:1 type:2 pair index:35 citer id:58 citer title:item-based collaborative filtering recommendation algorithms citer abstract:recommender systems apply knowledge discovery techniques to the problem of making personalized recommendations for information, products or services during a live interaction. these systems, especially the k-nearest neighbor collaborative filtering based ones, are achieving widespread success on the web. the tremendous growth in the amount of available information and the number of visitors to web sites in recent years poses some key challenges for recommender systems. these are: producing high quality recommendations, performing many recommendations per second for millions of users and items and achieving high coverage in the face of data sparsity. in traditional collaborative filtering systems the amount of work increases with the number of participants in the system. new recommender system technologies are needed that can quickly produce high quality recommendations, even for very large-scale problems. to address these issues we have explored item-based collaborative filtering techniques. item-based techniques first analyze the user-item matrix to identify relationships between di erent items, and then use these relationships to indirectly compute recommendations for users. in this paper we analyze di erent item-based recommendation generation algorithms. we look into di erent techniques for computing item-item similarities (e.g., item-item correlation vs. cosine similarities between item vectors) and di erent techniques for obtaining recommendations from them (e.g., weighted sum vs. regression model). finally, we experimentally evaluate our results and compare them to the basic k-nearest neighbor approach. our experiments suggest that item-based algorithms provide dramatically better performance than user-based algorithms, while at the same time providing better quality than the best available userbased algorithms citee id:97 citee title:recommendation as classification: using social and content-based information in recommendation citee abstract:recommendation systems make suggestions about artifacts to a user. for instance, they may predict whether a user would be interested in seeing a particular movie. social recomendation methods collect ratings of artifacts from many individuals and use nearest-neighbor techniques to make recommendations to a user concerning new artifacts. however, these methods do not use the significant amount of other information that is often available about the nature of each artifact --- such as cast lists surrounding text:the bayesian network model [6] formulates a probabilistic model for collaborative filtering problem. clustering model treats collaborative filtering as a classification problem [***, 6, 29] and works by clustering similar users in same class and estimating the probability that a particular user is in a particular class c, and from there computes the conditional probability of ratings. the rule-based approach applies association rule discovery algorithms to find association between co-purchased items and then generates item recommendation based on the strength of the association between items [25] influence:1 type:2 pair index:36 citer id:58 citer title:item-based collaborative filtering recommendation algorithms citer abstract:recommender systems apply knowledge discovery techniques to the problem of making personalized recommendations for information, products or services during a live interaction. these systems, especially the k-nearest neighbor collaborative filtering based ones, are achieving widespread success on the web. the tremendous growth in the amount of available information and the number of visitors to web sites in recent years poses some key challenges for recommender systems. these are: producing high quality recommendations, performing many recommendations per second for millions of users and items and achieving high coverage in the face of data sparsity. in traditional collaborative filtering systems the amount of work increases with the number of participants in the system. new recommender system technologies are needed that can quickly produce high quality recommendations, even for very large-scale problems. to address these issues we have explored item-based collaborative filtering techniques. item-based techniques first analyze the user-item matrix to identify relationships between di erent items, and then use these relationships to indirectly compute recommendations for users. in this paper we analyze di erent item-based recommendation generation algorithms. we look into di erent techniques for computing item-item similarities (e.g., item-item correlation vs. cosine similarities between item vectors) and di erent techniques for obtaining recommendations from them (e.g., weighted sum vs. regression model). finally, we experimentally evaluate our results and compare them to the basic k-nearest neighbor approach. our experiments suggest that item-based algorithms provide dramatically better performance than user-based algorithms, while at the same time providing better quality than the best available userbased algorithms citee id:99 citee title:learning collaborative information filters citee abstract:predicting items a user would like on the basis of other users ratings for these items has become a well-established strategy adopted by many recommendation services on the internet. although this can be seen as a classification problem, algorithms proposed thus far do not draw on results from the machine learning literature. we propose a representation for collaborative filtering tasks that allows the application of virtually any machine learning algorithm. we identify the shortcomings of current collaborative filtering techniques and propose the use of learning algorithms paired with feature extraction techniques that specifically address the limitations of previous approaches. our best-performing algorithm is based on the singular value decomposition of an initial matrix of user ratings, exploiting latent structure that essentially eliminates the need for users to rate common items in order to become predictors for one another's preferences. we evaluate the proposed algorithm on a large database of user ratings for motion pictures and find that our approach significantly outperforms current collaborative filtering algorithms. surrounding text:sparsity problem in recommender system has been addressed in [23, 11]. the problems associated with high dimensionality in recommender systems have been discussed in [***], and application of dimensionality reduction techniques to address these issues has been investigated in [24]. our work explores the extent to which item-based recommenders, a new class of recommender algorithms, are able to solve these problems influence:1 type:2 pair index:37 citer id:58 citer title:item-based collaborative filtering recommendation algorithms citer abstract:recommender systems apply knowledge discovery techniques to the problem of making personalized recommendations for information, products or services during a live interaction. these systems, especially the k-nearest neighbor collaborative filtering based ones, are achieving widespread success on the web. the tremendous growth in the amount of available information and the number of visitors to web sites in recent years poses some key challenges for recommender systems. these are: producing high quality recommendations, performing many recommendations per second for millions of users and items and achieving high coverage in the face of data sparsity. in traditional collaborative filtering systems the amount of work increases with the number of participants in the system. new recommender system technologies are needed that can quickly produce high quality recommendations, even for very large-scale problems. to address these issues we have explored item-based collaborative filtering techniques. item-based techniques first analyze the user-item matrix to identify relationships between di erent items, and then use these relationships to indirectly compute recommendations for users. in this paper we analyze di erent item-based recommendation generation algorithms. we look into di erent techniques for computing item-item similarities (e.g., item-item correlation vs. cosine similarities between item vectors) and di erent techniques for obtaining recommendations from them (e.g., weighted sum vs. regression model). finally, we experimentally evaluate our results and compare them to the basic k-nearest neighbor approach. our experiments suggest that item-based algorithms provide dramatically better performance than user-based algorithms, while at the same time providing better quality than the best available userbased algorithms citee id:35 citee title:empirical analysis of predictive algorithms for collaborative filtering citee abstract:collaborative filtering or recommender systems use a database about user preferences to predict additional topics or products a new user might like. in this paper we describe several algorithms designed for this task, including techniques based on correlation coefficients, vector-based similarity calculations, and statistical bayesian methods. we compare the predictive accuracy of the various methods in a set of representative problem domains. we use two basic classes of evaluation metrics. the first characterizes accuracy over a set of individual predictions in terms of average absolute deviation. the second estimates the utility of a ranked list of suggested items. this metric uses an estimate of the probability that a user will see a recommendation in an ordered list. experiments were run for datasets associated with 3 application areas, 4 experimental protocols, and the 2 evaluation metrics for the various algorithms. results indicate that for a wide range of conditions, bayesian networks with decision trees at each node and correlation methods outperform bayesian-clustering and vector-similarity methods. between correlation and bayesian networks, the preferred method depends on the nature of the dataset, nature of the application (ranked versus one-by-one presentation), and the availability of votes with which to make predictions. other considerations include the size of database, speed of predictions, and learning time. surrounding text:the model can be built o -line over a matter of hours or days. the resulting model is very small, very fast, and essentially as accurate as nearest neighbor methods [***]. bayesian networks may prove practical for environments in which knowledge of user preferences changes slowly with respect to the time needed to build the model but are not suitable for environments in which user preference models must be updated rapidly or frequently - the prediction is then an average across the clusters, weighted by degree of participation. clustering techniques usually produce less-personal recommendations than other methods, and in some cases, the clusters have worse accuracy than nearest neighbor algorithms [***]. once the clustering is complete, however, performance can be very good, since the size of the group that must be analyzed is much smaller - each individual ratings is within a numerical scale and it can as well be 0 indicating that the user has not yet rated that item. researchers have devised a number of collaborative filtering algorithms that can be divided into two main categoriesmemory-based (user-based) and model-based (item-based) algorithms [***]. in this section we provide a detailed analysis of cf-based recommender system algorithms - the model building process is performed by di erent machine learning algorithms such as bayesian network, clustering, and rule-based approaches. the bayesian network model [***] formulates a probabilistic model for collaborative filtering problem. clustering model treats collaborative filtering as a classification problem [2, ***, 29] and works by clustering similar users in same class and estimating the probability that a particular user is in a particular class c, and from there computes the conditional probability of ratings - the bayesian network model [***] formulates a probabilistic model for collaborative filtering problem. clustering model treats collaborative filtering as a classification problem [2, ***, 29] and works by clustering similar users in same class and estimating the probability that a particular user is in a particular class c, and from there computes the conditional probability of ratings. the rule-based approach applies association rule discovery algorithms to find association between co-purchased items and then generates item recommendation based on the strength of the association between items [25] - , t in} top-n list of items for the active user output interface figure 1: the collaborative filtering process. schemes have been proposed to compute the association between items ranging from probabilistic approach [***] to more traditional item-item correlations [15, 13]. we present a detailed analysis of our approach in the next section influence:1 type:2 pair index:38 citer id:58 citer title:item-based collaborative filtering recommendation algorithms citer abstract:recommender systems apply knowledge discovery techniques to the problem of making personalized recommendations for information, products or services during a live interaction. these systems, especially the k-nearest neighbor collaborative filtering based ones, are achieving widespread success on the web. the tremendous growth in the amount of available information and the number of visitors to web sites in recent years poses some key challenges for recommender systems. these are: producing high quality recommendations, performing many recommendations per second for millions of users and items and achieving high coverage in the face of data sparsity. in traditional collaborative filtering systems the amount of work increases with the number of participants in the system. new recommender system technologies are needed that can quickly produce high quality recommendations, even for very large-scale problems. to address these issues we have explored item-based collaborative filtering techniques. item-based techniques first analyze the user-item matrix to identify relationships between di erent items, and then use these relationships to indirectly compute recommendations for users. in this paper we analyze di erent item-based recommendation generation algorithms. we look into di erent techniques for computing item-item similarities (e.g., item-item correlation vs. cosine similarities between item vectors) and di erent techniques for obtaining recommendations from them (e.g., weighted sum vs. regression model). finally, we experimentally evaluate our results and compare them to the basic k-nearest neighbor approach. our experiments suggest that item-based algorithms provide dramatically better performance than user-based algorithms, while at the same time providing better quality than the best available userbased algorithms citee id:27 citee title:using collaborative filtering to weave an information tapestry citee abstract:the tapestry experimental mail system developed at the xerox palo alto research center is predicated on the belief that information filtering can be more effective when humans are involved in the filtering process. tapestry was designed to support both content-based filtering and collaborative filtering, which entails people collaborating to help each other perform filtering by recording their reactions to documents they read. the reactions are called annotations; they can be accessed by other peoples filters. tapestry is intended to handle any incoming stream of electronic documents and serves both as a mail filter and repository; its components are the indexer, document store, annotation store, filterer, little box, remailer, appraiser and reader/browser. tapestrys client/server architecture, its various components, and the tapestry query language are described. surrounding text:1 related work in this section we brie y present some of the research literature related to collaborative filtering, recommender systems, data mining and personalization. tapestry [***] is one of the earliest implementations of collaborative filtering-based recommender systems. this system relied on the explicit opinions of people from a close-knit community, such as an oce workgroup influence:1 type:2 pair index:39 citer id:58 citer title:item-based collaborative filtering recommendation algorithms citer abstract:recommender systems apply knowledge discovery techniques to the problem of making personalized recommendations for information, products or services during a live interaction. these systems, especially the k-nearest neighbor collaborative filtering based ones, are achieving widespread success on the web. the tremendous growth in the amount of available information and the number of visitors to web sites in recent years poses some key challenges for recommender systems. these are: producing high quality recommendations, performing many recommendations per second for millions of users and items and achieving high coverage in the face of data sparsity. in traditional collaborative filtering systems the amount of work increases with the number of participants in the system. new recommender system technologies are needed that can quickly produce high quality recommendations, even for very large-scale problems. to address these issues we have explored item-based collaborative filtering techniques. item-based techniques first analyze the user-item matrix to identify relationships between di erent items, and then use these relationships to indirectly compute recommendations for users. in this paper we analyze di erent item-based recommendation generation algorithms. we look into di erent techniques for computing item-item similarities (e.g., item-item correlation vs. cosine similarities between item vectors) and di erent techniques for obtaining recommendations from them (e.g., weighted sum vs. regression model). finally, we experimentally evaluate our results and compare them to the basic k-nearest neighbor approach. our experiments suggest that item-based algorithms provide dramatically better performance than user-based algorithms, while at the same time providing better quality than the best available userbased algorithms citee id:43 citee title:combining collaborative filtering with personal agents for better recommendations citee abstract:information filtering agents and collaborative filtering both attempt to alleviate information overload by identifying which items a user will find worthwhile. information filtering (if) focuses on the analysis of item content and the development of a personal user interest profile. collaborative filtering (cf) focuses on identification of other users with similar tastes and the use of their opinions to recommend items. each technique has advantages and limitations that suggest that the two could be beneficially combined. this paper shows that a cf framework can be used to combine personal if agents and the opinions of a community of users to produce better recommendations than either agents or users can produce alone. it also shows that using cf to create a personal combination of a set of agents produces better results than either individual agents or other combination mechanisms. one key implication of these results is that users can avoid having to select among agents; they can use them all and let the cf framework select the best ones for them. surrounding text:although these systems have been successful in the past, their widespread use has exposed some of their limitations such as the problems of sparsity in the data set, problems associated with high dimensionality and so on. sparsity problem in recommender system has been addressed in [23, ***]. the problems associated with high dimensionality in recommender systems have been discussed in [4], and application of dimensionality reduction techniques to address these issues has been investigated in [24] - the weakness of nearest neighbor algorithm for large, sparse databases led us to explore alternative recommender system algorithms. our first approach attempted to bridge the sparsity by incorporating semi-intelligent filtering agents into the system [23, ***]. these agents evaluated and rated each item using syntactic features influence:1 type:2 pair index:40 citer id:58 citer title:item-based collaborative filtering recommendation algorithms citer abstract:recommender systems apply knowledge discovery techniques to the problem of making personalized recommendations for information, products or services during a live interaction. these systems, especially the k-nearest neighbor collaborative filtering based ones, are achieving widespread success on the web. the tremendous growth in the amount of available information and the number of visitors to web sites in recent years poses some key challenges for recommender systems. these are: producing high quality recommendations, performing many recommendations per second for millions of users and items and achieving high coverage in the face of data sparsity. in traditional collaborative filtering systems the amount of work increases with the number of participants in the system. new recommender system technologies are needed that can quickly produce high quality recommendations, even for very large-scale problems. to address these issues we have explored item-based collaborative filtering techniques. item-based techniques first analyze the user-item matrix to identify relationships between di erent items, and then use these relationships to indirectly compute recommendations for users. in this paper we analyze di erent item-based recommendation generation algorithms. we look into di erent techniques for computing item-item similarities (e.g., item-item correlation vs. cosine similarities between item vectors) and di erent techniques for obtaining recommendations from them (e.g., weighted sum vs. regression model). finally, we experimentally evaluate our results and compare them to the basic k-nearest neighbor approach. our experiments suggest that item-based algorithms provide dramatically better performance than user-based algorithms, while at the same time providing better quality than the best available userbased algorithms citee id:75 citee title:evaluation of item-based top-n recommendation algorithms citee abstract:the explosive growth of the world-wide-web and the emergence of e-commerce has led to the development of recommender systems---a personalized information filtering technology used to identify a set of n items that will be of interest to a certain user. user-based collaborative filtering is the most successful technology for building recommender systems to date, and is extensively used in many commercial recommender systems. unfortunately, the computational complexity of these methods grows surrounding text:, t in} top-n list of items for the active user output interface figure 1: the collaborative filtering process. schemes have been proposed to compute the association between items ranging from probabilistic approach [6] to more traditional item-item correlations [***, 13]. we present a detailed analysis of our approach in the next section influence:1 type:2 pair index:41 citer id:58 citer title:item-based collaborative filtering recommendation algorithms citer abstract:recommender systems apply knowledge discovery techniques to the problem of making personalized recommendations for information, products or services during a live interaction. these systems, especially the k-nearest neighbor collaborative filtering based ones, are achieving widespread success on the web. the tremendous growth in the amount of available information and the number of visitors to web sites in recent years poses some key challenges for recommender systems. these are: producing high quality recommendations, performing many recommendations per second for millions of users and items and achieving high coverage in the face of data sparsity. in traditional collaborative filtering systems the amount of work increases with the number of participants in the system. new recommender system technologies are needed that can quickly produce high quality recommendations, even for very large-scale problems. to address these issues we have explored item-based collaborative filtering techniques. item-based techniques first analyze the user-item matrix to identify relationships between di erent items, and then use these relationships to indirectly compute recommendations for users. in this paper we analyze di erent item-based recommendation generation algorithms. we look into di erent techniques for computing item-item similarities (e.g., item-item correlation vs. cosine similarities between item vectors) and di erent techniques for obtaining recommendations from them (e.g., weighted sum vs. regression model). finally, we experimentally evaluate our results and compare them to the basic k-nearest neighbor approach. our experiments suggest that item-based algorithms provide dramatically better performance than user-based algorithms, while at the same time providing better quality than the best available userbased algorithms citee id:28 citee title:grouplens: applying collaborative filtering to usenet news citee abstract:the grouplens project designed, implemented, and evaluated a collaborative filtering system for usenet newsa high-volume, high-turnover discussion list service on the internet. usenet newsgroupsthe individual discussion listsmay carry hundreds of messages each day. while in theory the newsgroup organization allows readers to select the content that most interests them, in practice surrounding text:through all the available information to find that which is most valuable to us. one of the most promising such technologies is collaborative filtering [19, 27, 14, ***]. collaborative filtering works by building a database of preferences for items by users - later, several ratings-based automated recommender systems were developed. the grouplens research system [19, ***] provides a pseudonymous collaborative filtering solution for usenet news and movies. ringo [27] and video recommender [14] are email and webbased systems that generate recommendations on music and movies, respectively - each user ui has a list of items iui , which the user has expressed his/her opinions about. opinions can be explicitly given by the user as a rating score, generally within a certain numerical scale, or can be implicitly derived from purchase records, by analyzing timing logs, by mining web hyperlinks and so on [28, ***]. note that iui  i and it is possible for iui to be a null-set influence:1 type:2 pair index:42 citer id:58 citer title:item-based collaborative filtering recommendation algorithms citer abstract:recommender systems apply knowledge discovery techniques to the problem of making personalized recommendations for information, products or services during a live interaction. these systems, especially the k-nearest neighbor collaborative filtering based ones, are achieving widespread success on the web. the tremendous growth in the amount of available information and the number of visitors to web sites in recent years poses some key challenges for recommender systems. these are: producing high quality recommendations, performing many recommendations per second for millions of users and items and achieving high coverage in the face of data sparsity. in traditional collaborative filtering systems the amount of work increases with the number of participants in the system. new recommender system technologies are needed that can quickly produce high quality recommendations, even for very large-scale problems. to address these issues we have explored item-based collaborative filtering techniques. item-based techniques first analyze the user-item matrix to identify relationships between di erent items, and then use these relationships to indirectly compute recommendations for users. in this paper we analyze di erent item-based recommendation generation algorithms. we look into di erent techniques for computing item-item similarities (e.g., item-item correlation vs. cosine similarities between item vectors) and di erent techniques for obtaining recommendations from them (e.g., weighted sum vs. regression model). finally, we experimentally evaluate our results and compare them to the basic k-nearest neighbor approach. our experiments suggest that item-based algorithms provide dramatically better performance than user-based algorithms, while at the same time providing better quality than the best available userbased algorithms citee id:33 citee title:grouplens: an open architecture for collaborative filtering of netnews citee abstract:collaborative filters help people make choices based on the opinions of other people. grouplens is a system for collaborative filtering of netnews, to help people find articles they will like in the huge stream of available articles. news reader clients display predicted scores and make it easy for users to rate articles after they read them. rating servers, called better bit bureaus, gather and disseminate the ratings. the rating servers predict scores based on the heuristic that people who surrounding text:through all the available information to find that which is most valuable to us. one of the most promising such technologies is collaborative filtering [***, 27, 14, 16]. collaborative filtering works by building a database of preferences for items by users - later, several ratings-based automated recommender systems were developed. the grouplens research system [***, 16] provides a pseudonymous collaborative filtering solution for usenet news and movies. ringo [27] and video recommender [14] are email and webbased systems that generate recommendations on music and movies, respectively - recommendations can be based on demographics of the users, overall top selling items, or past buying habit of users as a predictor of future items. collaborative filtering (cf) [***, 27] is the most successful recommendation technique to date. the basic idea of cf-based algorithms is to provide item recommendations or predictions based on the opinions of other like-minded 286 users influence:1 type:2 pair index:43 citer id:58 citer title:item-based collaborative filtering recommendation algorithms citer abstract:recommender systems apply knowledge discovery techniques to the problem of making personalized recommendations for information, products or services during a live interaction. these systems, especially the k-nearest neighbor collaborative filtering based ones, are achieving widespread success on the web. the tremendous growth in the amount of available information and the number of visitors to web sites in recent years poses some key challenges for recommender systems. these are: producing high quality recommendations, performing many recommendations per second for millions of users and items and achieving high coverage in the face of data sparsity. in traditional collaborative filtering systems the amount of work increases with the number of participants in the system. new recommender system technologies are needed that can quickly produce high quality recommendations, even for very large-scale problems. to address these issues we have explored item-based collaborative filtering techniques. item-based techniques first analyze the user-item matrix to identify relationships between di erent items, and then use these relationships to indirectly compute recommendations for users. in this paper we analyze di erent item-based recommendation generation algorithms. we look into di erent techniques for computing item-item similarities (e.g., item-item correlation vs. cosine similarities between item vectors) and di erent techniques for obtaining recommendations from them (e.g., weighted sum vs. regression model). finally, we experimentally evaluate our results and compare them to the basic k-nearest neighbor approach. our experiments suggest that item-based algorithms provide dramatically better performance than user-based algorithms, while at the same time providing better quality than the best available userbased algorithms citee id:100 citee title:recommender systems citee abstract:recommender systems assist and augment this natural social process. in a typical recommender system people provide recommendations as inputs, which the system then aggregates and directs to appropriate recipients. in some cases the primary surrounding text:ringo [27] and video recommender [14] are email and webbased systems that generate recommendations on music and movies, respectively. a special issue of communications of the acm [***] presents a number of di erent recommender systems. other technologies have also been applied to recommender systems, including bayesian networks, clustering, and horting influence:2 type:3 pair index:44 citer id:58 citer title:item-based collaborative filtering recommendation algorithms citer abstract:recommender systems apply knowledge discovery techniques to the problem of making personalized recommendations for information, products or services during a live interaction. these systems, especially the k-nearest neighbor collaborative filtering based ones, are achieving widespread success on the web. the tremendous growth in the amount of available information and the number of visitors to web sites in recent years poses some key challenges for recommender systems. these are: producing high quality recommendations, performing many recommendations per second for millions of users and items and achieving high coverage in the face of data sparsity. in traditional collaborative filtering systems the amount of work increases with the number of participants in the system. new recommender system technologies are needed that can quickly produce high quality recommendations, even for very large-scale problems. to address these issues we have explored item-based collaborative filtering techniques. item-based techniques first analyze the user-item matrix to identify relationships between di erent items, and then use these relationships to indirectly compute recommendations for users. in this paper we analyze di erent item-based recommendation generation algorithms. we look into di erent techniques for computing item-item similarities (e.g., item-item correlation vs. cosine similarities between item vectors) and di erent techniques for obtaining recommendations from them (e.g., weighted sum vs. regression model). finally, we experimentally evaluate our results and compare them to the basic k-nearest neighbor approach. our experiments suggest that item-based algorithms provide dramatically better performance than user-based algorithms, while at the same time providing better quality than the best available userbased algorithms citee id:102 citee title:using filtering agents to improve prediction quality in the grouplens research collaborative filtering system citee abstract:collaborative filtering systems help address information overload by using the opinions of users in a community to make personal recommendations for documents to each user. many collaborative filtering systems have few user opinions relative to the large number of documents available. this sparsity problem can reduce the utility of the filtering system by reducing the number of documents for which the system can make recommendations and adversely affecting the quality of recommendations. this paper defines and implements a model for integrating content-based ratings into a collaborative filtering system. the filterbot model allows collaborative filtering systems to address sparsity by tapping the strength of content filtering techniques. we identify and evaluate metrics for assessing the effectiveness of filterbots specifically, and filtering system enhancements in general. finally, we experimentally validate the filterbot approach by showing that even simple filterbots such as spell checking can increase the utility for users of sparsely populated collaborative filtering systems. surrounding text:although these systems have been successful in the past, their widespread use has exposed some of their limitations such as the problems of sparsity in the data set, problems associated with high dimensionality and so on. sparsity problem in recommender system has been addressed in [***, 11]. the problems associated with high dimensionality in recommender systems have been discussed in [4], and application of dimensionality reduction techniques to address these issues has been investigated in [24] - the weakness of nearest neighbor algorithm for large, sparse databases led us to explore alternative recommender system algorithms. our first approach attempted to bridge the sparsity by incorporating semi-intelligent filtering agents into the system [***, 11]. these agents evaluated and rated each item using syntactic features - with this observation, whether a item has a prediction score of 1:5 or 2:5 on a five-point scale is irrelevant if the user only chooses to consider predictions of 4 or higher. the 290 most commonly used decision support accuracy metrics are reversal rate, weighted errors and roc sensitivity [***]. we used mae as our choice of evaluation metric to report prediction experiments because it is most commonly used and easiest to interpret directly - we used mae as our choice of evaluation metric to report prediction experiments because it is most commonly used and easiest to interpret directly. in our previous experiments [***] we have seen that mae and roc provide the same ordering of di erent experimental schemes in terms of prediction quality. 4 influence:1 type:2 pair index:45 citer id:58 citer title:item-based collaborative filtering recommendation algorithms citer abstract:recommender systems apply knowledge discovery techniques to the problem of making personalized recommendations for information, products or services during a live interaction. these systems, especially the k-nearest neighbor collaborative filtering based ones, are achieving widespread success on the web. the tremendous growth in the amount of available information and the number of visitors to web sites in recent years poses some key challenges for recommender systems. these are: producing high quality recommendations, performing many recommendations per second for millions of users and items and achieving high coverage in the face of data sparsity. in traditional collaborative filtering systems the amount of work increases with the number of participants in the system. new recommender system technologies are needed that can quickly produce high quality recommendations, even for very large-scale problems. to address these issues we have explored item-based collaborative filtering techniques. item-based techniques first analyze the user-item matrix to identify relationships between di erent items, and then use these relationships to indirectly compute recommendations for users. in this paper we analyze di erent item-based recommendation generation algorithms. we look into di erent techniques for computing item-item similarities (e.g., item-item correlation vs. cosine similarities between item vectors) and di erent techniques for obtaining recommendations from them (e.g., weighted sum vs. regression model). finally, we experimentally evaluate our results and compare them to the basic k-nearest neighbor approach. our experiments suggest that item-based algorithms provide dramatically better performance than user-based algorithms, while at the same time providing better quality than the best available userbased algorithms citee id:103 citee title:recommender systems in e-commerce citee abstract:recommender systems are changing from novelties used by a few e-commerce sites, to serious business tools that are re-shaping the world of e-commerce. many of the largest commerce web sites are already using recommender systems to help their customers find products to purchase. a recommender system learns from a customer and recommends products that she will find most valuable from among the available products. in this paper we present an explanation of how recommender systems help e-commerce sites increase sales, and analyze six sites that use recommender systems including several sites that use more than one recommender system. based on the examples, we create a taxonomy of recommender systems, including the interfaces they present to customers, the technologies used to create the recommendations, and the inputs they need from customers. we conclude with ideas for new applications of recommender systems to e-commerce. surrounding text:schafer et al. , [***] present a detailed taxonomy and examples of recommender systems used in e-commerce and how they can provide one-to-one personalization and at the same can capture customer loyalty. although these systems have been successful in the past, their widespread use has exposed some of their limitations such as the problems of sparsity in the data set, problems associated with high dimensionality and so on influence:2 type:3,2 pair index:46 citer id:58 citer title:item-based collaborative filtering recommendation algorithms citer abstract:recommender systems apply knowledge discovery techniques to the problem of making personalized recommendations for information, products or services during a live interaction. these systems, especially the k-nearest neighbor collaborative filtering based ones, are achieving widespread success on the web. the tremendous growth in the amount of available information and the number of visitors to web sites in recent years poses some key challenges for recommender systems. these are: producing high quality recommendations, performing many recommendations per second for millions of users and items and achieving high coverage in the face of data sparsity. in traditional collaborative filtering systems the amount of work increases with the number of participants in the system. new recommender system technologies are needed that can quickly produce high quality recommendations, even for very large-scale problems. to address these issues we have explored item-based collaborative filtering techniques. item-based techniques first analyze the user-item matrix to identify relationships between di erent items, and then use these relationships to indirectly compute recommendations for users. in this paper we analyze di erent item-based recommendation generation algorithms. we look into di erent techniques for computing item-item similarities (e.g., item-item correlation vs. cosine similarities between item vectors) and di erent techniques for obtaining recommendations from them (e.g., weighted sum vs. regression model). finally, we experimentally evaluate our results and compare them to the basic k-nearest neighbor approach. our experiments suggest that item-based algorithms provide dramatically better performance than user-based algorithms, while at the same time providing better quality than the best available userbased algorithms citee id:104 citee title:phoaks: a system for sharing recommendations citee abstract:finding relevant, high-quality information on theworld-wide web is a difficult problem. phoaks (people helping one another know stuff) is an experimental system that addresses this problem through a collaborative filtering approach. phoaks works by automatically recognizing, tallying, and redistributing recommendations of web resources mined from usenet news messages. surrounding text:each user ui has a list of items iui , which the user has expressed his/her opinions about. opinions can be explicitly given by the user as a rating score, generally within a certain numerical scale, or can be implicitly derived from purchase records, by analyzing timing logs, by mining web hyperlinks and so on [***, 16]. note that iui  i and it is possible for iui to be a null-set influence:1 type:2 pair index:47 citer id:58 citer title:item-based collaborative filtering recommendation algorithms citer abstract:recommender systems apply knowledge discovery techniques to the problem of making personalized recommendations for information, products or services during a live interaction. these systems, especially the k-nearest neighbor collaborative filtering based ones, are achieving widespread success on the web. the tremendous growth in the amount of available information and the number of visitors to web sites in recent years poses some key challenges for recommender systems. these are: producing high quality recommendations, performing many recommendations per second for millions of users and items and achieving high coverage in the face of data sparsity. in traditional collaborative filtering systems the amount of work increases with the number of participants in the system. new recommender system technologies are needed that can quickly produce high quality recommendations, even for very large-scale problems. to address these issues we have explored item-based collaborative filtering techniques. item-based techniques first analyze the user-item matrix to identify relationships between di erent items, and then use these relationships to indirectly compute recommendations for users. in this paper we analyze di erent item-based recommendation generation algorithms. we look into di erent techniques for computing item-item similarities (e.g., item-item correlation vs. cosine similarities between item vectors) and di erent techniques for obtaining recommendations from them (e.g., weighted sum vs. regression model). finally, we experimentally evaluate our results and compare them to the basic k-nearest neighbor approach. our experiments suggest that item-based algorithms provide dramatically better performance than user-based algorithms, while at the same time providing better quality than the best available userbased algorithms citee id:26 citee title:clustering methods for collaborative filtering citee abstract:grouping people into clusters based on the items they have purchased allows accurate recommendations of new items for purchase: if you and i have liked many of the same movies, then i will probably enjoy other movies that you like. recommending items based on similarity of interest (a.k.a. collaborative filtering) is attractive for many domains: books, cds, movies, etc., but does not always work well. because data are always sparse { any given person has seen only a small fraction of all movies { much more accurate predictions can be made by grouping people into clusters with similar movies and grouping movies into clusters which tend to be liked by the same people. finding optimal clusters is tricky because the movie groups should be used to help determine the people groups and visa versa. we present a formal statistical model of collaborative filtering, and compare di erent algorithms for estimating the model parameters including variations of k-means clustering and gibbs sampling. this formal model is easily extended to handle clustering of objects with multiple attributes surrounding text:the bayesian network model [6] formulates a probabilistic model for collaborative filtering problem. clustering model treats collaborative filtering as a classification problem [2, 6, ***] and works by clustering similar users in same class and estimating the probability that a particular user is in a particular class c, and from there computes the conditional probability of ratings. the rule-based approach applies association rule discovery algorithms to find association between co-purchased items and then generates item recommendation based on the strength of the association between items [25] influence:1 type:2 pair index:48 citer id:61 citer title:scalable collaborative filtering using cluster-based smoothing citer abstract:memory-based approaches for collaborative filtering identify the similarity between two users by comparing their ratings on a set of items. in the past, the memory-based approaches have been shown to suffer from two fundamental problems: data sparsity and difficulty in scalability. alternatively, the model-based approaches have been proposed to alleviate these problems, but these approaches tends to limit the range of users. in this paper, we present a novel approach that combines the advantages of these two kinds of approaches by introducing a smoothing-based method. in our approach, clusters generated from the training data provide the basis for data smoothing and neighborhood selection. as a result, we provide higher accuracy as well as increased efficiency in recommendations. empirical studies on two datasets (eachmovie and movielens) show that our new proposed approach consistently outperforms other state-of-the-art collaborative filtering algorithms citee id:77 citee title:fab: content-based, collaborative recommendation citee abstract:online readers are in need of tools to help them cope with the mass of content available on the world-wide web. in traditional media, readers are provided assistance in making selections. this includes both implicit assistance in the form of editorial oversight and explicit assistance in the form of recommendation services such as movie reviews and restaurant guides. the electronic medium offers new opportunities to create recommendation services, ones that adapt over time to track their evolving interests. fab is such a recommendation system for the web, and has been operational in several versions since december 1994. surrounding text:by considering the association between users and items, transitive associations of the associative-retrieval technique [11] are proposed to iteratively reinforce the similarity of the users and the similarity of items. content-boosted cf [***][5] approaches require additional information regarding items as well as a metric to compute meaningful similarities among them. in [17], a - umn. edu/research/grouplens/) and eachmovie [***]. for movielens dataset, we extracted a subset of 500 users with more than 40 ratings - 2 metrics and methodology we use the mean absolute error (mae) [10], a statistical accuracy metrics, to measure the prediction quality metric: ( ) ( ) ~ t r t r t mae u t u j u j . = (9) [***] eachmovie dataset is provided by the compaq systems research center. for more information see http://www - 6. references [***] m. balabanovic, y influence:1 type:2 pair index:49 citer id:61 citer title:scalable collaborative filtering using cluster-based smoothing citer abstract:memory-based approaches for collaborative filtering identify the similarity between two users by comparing their ratings on a set of items. in the past, the memory-based approaches have been shown to suffer from two fundamental problems: data sparsity and difficulty in scalability. alternatively, the model-based approaches have been proposed to alleviate these problems, but these approaches tends to limit the range of users. in this paper, we present a novel approach that combines the advantages of these two kinds of approaches by introducing a smoothing-based method. in our approach, clusters generated from the training data provide the basis for data smoothing and neighborhood selection. as a result, we provide higher accuracy as well as increased efficiency in recommendations. empirical studies on two datasets (eachmovie and movielens) show that our new proposed approach consistently outperforms other state-of-the-art collaborative filtering algorithms citee id:23 citee title:class-based n-gram models of natural language citee abstract:we address the problem of predicting a word from previous words in a sample of text. in particular, we discuss n-gram models based on classes of words. we also discuss several statistical algorithms for assigning words to classes based on the frequency of their co-occurrence with other words. we find that we are able to extract classes that have the flavor of either syntactically based groupings or semantically based groupings, depending on the nature of the underlying statistics. surrounding text:to fill the missing values in data set, we make explicit use of clusters as smoothing mechanisms. clusterbased smoothing technique for nature language processing [***] is successful to estimate probability of the unseen term by using the topic (cluster) of the term belongs to, which motivate us to examine the sparsity problem on collaborative filtering. based on the clustering results, we apply the smoothing strategies to the unseen rating data influence:3 type:3 pair index:50 citer id:61 citer title:scalable collaborative filtering using cluster-based smoothing citer abstract:memory-based approaches for collaborative filtering identify the similarity between two users by comparing their ratings on a set of items. in the past, the memory-based approaches have been shown to suffer from two fundamental problems: data sparsity and difficulty in scalability. alternatively, the model-based approaches have been proposed to alleviate these problems, but these approaches tends to limit the range of users. in this paper, we present a novel approach that combines the advantages of these two kinds of approaches by introducing a smoothing-based method. in our approach, clusters generated from the training data provide the basis for data smoothing and neighborhood selection. as a result, we provide higher accuracy as well as increased efficiency in recommendations. empirical studies on two datasets (eachmovie and movielens) show that our new proposed approach consistently outperforms other state-of-the-art collaborative filtering algorithms citee id:35 citee title:empirical analysis of predictive algorithms for collaborative filtering citee abstract:collaborative filtering or recommender systems use a database about user preferences to predict additional topics or products a new user might like. in this paper we describe several algorithms designed for this task, including techniques based on correlation coefficients, vector-based similarity calculations, and statistical bayesian methods. we compare the predictive accuracy of the various methods in a set of representative problem domains. we use two basic classes of evaluation metrics. the first characterizes accuracy over a set of individual predictions in terms of average absolute deviation. the second estimates the utility of a ranked list of suggested items. this metric uses an estimate of the probability that a user will see a recommendation in an ordered list. experiments were run for datasets associated with 3 application areas, 4 experimental protocols, and the 2 evaluation metrics for the various algorithms. results indicate that for a wide range of conditions, bayesian networks with decision trees at each node and correlation methods outperform bayesian-clustering and vector-similarity methods. between correlation and bayesian networks, the preferred method depends on the nature of the dataset, nature of the application (ranked versus one-by-one presentation), and the availability of votes with which to make predictions. other considerations include the size of database, speed of predictions, and learning time. surrounding text:memory-based algorithms perform the computation on the entire database to identify the top k most similar users to the active user from the training database in terms of the rating patterns and then combines those ratings together. notable examples include the pearson-correlation based approach [16], the vector similarity based approach [***], and the extended generalized vector-space model [20]. these approaches focused on utilizing the existing rating of a training user as the features, however, the memory-based method suffers from two fundamental problems: data sparsity and inability to scale up - in order to predict the rating from an active user on a particular item, these approaches first categorize the active user into one or more of the predefined user classes and use the rating of the predicted classes on the targeted item as the prediction. algorithms within this category include bayesian network approach [***], clustering approach [13][21] and the aspect models [12]. the model-based approaches are often time-consuming to build and update, and cannot cover as diverse a user range as the memory-based approaches do - 1. 1 memory-based approaches the memory-based approaches [***] are among the most popular prediction techniques in collaborative filtering. the basic idea is to compute the active users predicted vote of an item as a weighted average of votes by other similar users or k nearest neighbors (knn) - the basic idea is to compute the active users predicted vote of an item as a weighted average of votes by other similar users or k nearest neighbors (knn). two commonly used memory-based algorithms are the pearson correlation coefficient (pcc) algorithm [16] and the vector space similarity (vss) algorithm [***]. these two approaches differ in the computation of similarity - these two approaches differ in the computation of similarity. as described in [***], the pcc algorithm generally achieves higher performance than vector-space similarity method. 2 influence:1 type:2 pair index:51 citer id:61 citer title:scalable collaborative filtering using cluster-based smoothing citer abstract:memory-based approaches for collaborative filtering identify the similarity between two users by comparing their ratings on a set of items. in the past, the memory-based approaches have been shown to suffer from two fundamental problems: data sparsity and difficulty in scalability. alternatively, the model-based approaches have been proposed to alleviate these problems, but these approaches tends to limit the range of users. in this paper, we present a novel approach that combines the advantages of these two kinds of approaches by introducing a smoothing-based method. in our approach, clusters generated from the training data provide the basis for data smoothing and neighborhood selection. as a result, we provide higher accuracy as well as increased efficiency in recommendations. empirical studies on two datasets (eachmovie and movielens) show that our new proposed approach consistently outperforms other state-of-the-art collaborative filtering algorithms citee id:44 citee title:combining content-based and collaborative filters in an online newspaper citee abstract:the explosive growth of mailing lists, web sites and usenet news demands effective filtering solutions. collaborative filtering combines the informed opinions of humans to make personalized, accurate predictions. content-based filtering uses the speed of computers to make complete, fast predictions. in this work, we present a new filtering approach that combines the coverage and speed of content-filters with the depth of collaborative filtering. we apply our research approach to an online newspaper, an as yet untapped opportunity for filters useful to the wide-spread news reading populace. we present the design of our filtering system and describe the results from preliminary experiments that suggest merits to our approach. surrounding text:by considering the association between users and items, transitive associations of the associative-retrieval technique [11] are proposed to iteratively reinforce the similarity of the users and the similarity of items. content-boosted cf [1][***] approaches require additional information regarding items as well as a metric to compute meaningful similarities among them. in [17], a influence:1 type:2 pair index:52 citer id:61 citer title:scalable collaborative filtering using cluster-based smoothing citer abstract:memory-based approaches for collaborative filtering identify the similarity between two users by comparing their ratings on a set of items. in the past, the memory-based approaches have been shown to suffer from two fundamental problems: data sparsity and difficulty in scalability. alternatively, the model-based approaches have been proposed to alleviate these problems, but these approaches tends to limit the range of users. in this paper, we present a novel approach that combines the advantages of these two kinds of approaches by introducing a smoothing-based method. in our approach, clusters generated from the training data provide the basis for data smoothing and neighborhood selection. as a result, we provide higher accuracy as well as increased efficiency in recommendations. empirical studies on two datasets (eachmovie and movielens) show that our new proposed approach consistently outperforms other state-of-the-art collaborative filtering algorithms citee id:39 citee title:swami: a framework for collaborative filtering algorithm development and evaluation citee abstract:we present a java-based framework, swami (shared wisdom through the amalgamation of many interpretations) for building and studying collaborative filtering systems. swami consists of three components: a prediction engine, an evaluation system, and a visualization component. the prediction engine provides a common interface for implementing different prediction algorithms. the evaluation system provides a standardized testing methodology and metrics for analyzing the accuracy and run-time performance of prediction algorithms. the visualization component suggests how graphical representations can inform the development and analysis of prediction algorithms. we demonstrate swami on the eachmovie data set by comparing three prediction algorithms: a traditional pearson correlation-based method, support vector machines, and a new accurate and scalable correlation-based method based on clustering techniques. surrounding text:a simple strategy is to form clusters of users or items and then use these clusters as basic units in making recommendation. principle component analysis (pca) [8] and information retrieval techniques such as latent semantic indexing (lsi) [***][18] are also proposed. zeng [23] proposed to compute the users similarity by a matrix conversion method for similarity measure influence:1 type:2 pair index:53 citer id:61 citer title:scalable collaborative filtering using cluster-based smoothing citer abstract:memory-based approaches for collaborative filtering identify the similarity between two users by comparing their ratings on a set of items. in the past, the memory-based approaches have been shown to suffer from two fundamental problems: data sparsity and difficulty in scalability. alternatively, the model-based approaches have been proposed to alleviate these problems, but these approaches tends to limit the range of users. in this paper, we present a novel approach that combines the advantages of these two kinds of approaches by introducing a smoothing-based method. in our approach, clusters generated from the training data provide the basis for data smoothing and neighborhood selection. as a result, we provide higher accuracy as well as increased efficiency in recommendations. empirical studies on two datasets (eachmovie and movielens) show that our new proposed approach consistently outperforms other state-of-the-art collaborative filtering algorithms citee id:67 citee title:eigentaste: a constant time collaborative filtering algorithm citee abstract:eigentaste is a collaborative filtering algorithm that uses universal queries to elicit real-valued user ratings on a common set of items and applies principal component analysis (pca) to the resulting dense subset of the ratings matrix. pca facilitates dimensionality reduction for offline clustering of users and rapid computation of recommendations. for a database of n users, standard nearest-neighbor techniques require o(n) processing time to compute recommendations, whereas eigentaste requires o(1) (constant) time. we compare eigentaste to alternative algorithms using data from jester, an online joke recommending system. jester has collected approximately 2,500,000 ratings from 57,000 users. we use the normalized mean absolute error (nmae) measure to compare performance of different algorithms. in the appendix we use uniform and normal distribution models to derive analytic estimates of nmae when predictions are random. on the jester dataset, eigentaste computes recommendations two orders of magnitude faster with no loss of accuracy. jester is online at: http://eigentaste.berkeley.edu surrounding text:a simple strategy is to form clusters of users or items and then use these clusters as basic units in making recommendation. principle component analysis (pca) [***] and information retrieval techniques such as latent semantic indexing (lsi) [7][18] are also proposed. zeng [23] proposed to compute the users similarity by a matrix conversion method for similarity measure influence:1 type:2 pair index:54 citer id:61 citer title:scalable collaborative filtering using cluster-based smoothing citer abstract:memory-based approaches for collaborative filtering identify the similarity between two users by comparing their ratings on a set of items. in the past, the memory-based approaches have been shown to suffer from two fundamental problems: data sparsity and difficulty in scalability. alternatively, the model-based approaches have been proposed to alleviate these problems, but these approaches tends to limit the range of users. in this paper, we present a novel approach that combines the advantages of these two kinds of approaches by introducing a smoothing-based method. in our approach, clusters generated from the training data provide the basis for data smoothing and neighborhood selection. as a result, we provide higher accuracy as well as increased efficiency in recommendations. empirical studies on two datasets (eachmovie and movielens) show that our new proposed approach consistently outperforms other state-of-the-art collaborative filtering algorithms citee id:74 citee title:evaluating collaborative filtering recommender systems citee abstract:recommender systems have been evaluated in many, often incomparable, ways. in this article, we review the key decisions in evaluating collaborative filtering recommender systems: the user tasks being evaluated, the types of analysis and datasets being used, the ways in which prediction quality is measured, the evaluation of prediction attributes other than quality, and the user-based evaluation of the system as a whole. in addition to reviewing the evaluation strategies used by prior researchers, we present empirical results from the analysis of various accuracy metrics on one content domain where all the tested metrics collapsed roughly into three equivalence classes. metrics within each equivalency class were strongly correlated, while metrics from different equivalency classes were uncorrelated. surrounding text:characteristics of movierating and eachmovie 4. 2 metrics and methodology we use the mean absolute error (mae) [***], a statistical accuracy metrics, to measure the prediction quality metric: ( ) ( ) ~ t r t r t mae u t u j u j . = (9) [1] eachmovie dataset is provided by the compaq systems research center influence:2 type:3,2 pair index:55 citer id:61 citer title:scalable collaborative filtering using cluster-based smoothing citer abstract:memory-based approaches for collaborative filtering identify the similarity between two users by comparing their ratings on a set of items. in the past, the memory-based approaches have been shown to suffer from two fundamental problems: data sparsity and difficulty in scalability. alternatively, the model-based approaches have been proposed to alleviate these problems, but these approaches tends to limit the range of users. in this paper, we present a novel approach that combines the advantages of these two kinds of approaches by introducing a smoothing-based method. in our approach, clusters generated from the training data provide the basis for data smoothing and neighborhood selection. as a result, we provide higher accuracy as well as increased efficiency in recommendations. empirical studies on two datasets (eachmovie and movielens) show that our new proposed approach consistently outperforms other state-of-the-art collaborative filtering algorithms citee id:12 citee title:applying associative retrieval techniques to alleviate the sparsity problem in collaborative filtering citee abstract:recommender systems are being widely applied in many application settings to suggest products, services, and information items to potential consumers. collaborative filtering, the most successful recommendation approach, makes recommendations based on past transactions and feedback from consumers sharing similar interests. a major problem limiting the usefulness of collaborative filtering is the sparsity problem, which refers to a situation in which transactional or feedback data is sparse and insufficient to identify similarities in consumer interests. in this article, we propose to deal with this sparsity problem by applying an associative retrieval framework and related spreading activation algorithms to explore transitive associations among consumers through their past transactions and feedback. such transitive associations are a valuable source of information to help infer consumer interests and can be explored to deal with the sparsity problem. to evaluate the effectiveness of our approach, we have conducted an experimental study using a data set from an online bookstore. we experimented with three spreading activation algorithms including a constrained leaky capacitor algorithm, a branch-and-bound serial symbolic search algorithm, and a hopfield net parallel relaxation search algorithm. these algorithms were compared with several collaborative filtering approaches that do not consider the transitive associations: a simple graph search approach, two variations of the user-based approach, and an item-based approach. our experimental results indicate that spreading activation-based approaches significantly outperformed the other collaborative filtering methods as measured by recommendation precision, recall, the f-measure, and the rank score. we also observed the over-activation effect of the spreading activation approach, that is, incorporating transitive associations with past transactional data that is not sparse may "dilute" the data used to infer user preferences and lead to degradation in recommendation performance surrounding text:however, potentially useful information might be lost during this reduction process. by considering the association between users and items, transitive associations of the associative-retrieval technique [***] are proposed to iteratively reinforce the similarity of the users and the similarity of items. content-boosted cf [1][5] approaches require additional information regarding items as well as a metric to compute meaningful similarities among them influence:1 type:2 pair index:56 citer id:61 citer title:scalable collaborative filtering using cluster-based smoothing citer abstract:memory-based approaches for collaborative filtering identify the similarity between two users by comparing their ratings on a set of items. in the past, the memory-based approaches have been shown to suffer from two fundamental problems: data sparsity and difficulty in scalability. alternatively, the model-based approaches have been proposed to alleviate these problems, but these approaches tends to limit the range of users. in this paper, we present a novel approach that combines the advantages of these two kinds of approaches by introducing a smoothing-based method. in our approach, clusters generated from the training data provide the basis for data smoothing and neighborhood selection. as a result, we provide higher accuracy as well as increased efficiency in recommendations. empirical studies on two datasets (eachmovie and movielens) show that our new proposed approach consistently outperforms other state-of-the-art collaborative filtering algorithms citee id:36 citee title:latent class models for collaborative filtering citee abstract:this paper presents a statistical approach to collaborative filtering and investigates the use of latent class models for predicting individual choices and preferences based on observed preference behavior. two models are discussed and compared: the aspect model, a probabilistic latent space model which models individual preferences as a convex combination of preference factors, and the two-sided clustering model, which simultaneously partitions persons and objects into clusters. we present em algorithms for different variants of the aspect model and derive an approximate em algorithm based on a variational principle for the two-sided clustering model. the benefits of the different models are experimentally investigated on a large movie data set. surrounding text:in order to predict the rating from an active user on a particular item, these approaches first categorize the active user into one or more of the predefined user classes and use the rating of the predicted classes on the targeted item as the prediction. algorithms within this category include bayesian network approach [4], clustering approach [13][21] and the aspect models [***]. the model-based approaches are often time-consuming to build and update, and cannot cover as diverse a user range as the memory-based approaches do - 1. 2 model-based approaches two popular model-based algorithms are the clustering for collaborative filtering [13][21] and the aspect models [***]. clustering techniques work by identifying groups of users who appear to have similar preferences - the prediction is then an average across the clusters, weighted by the degree of participation. the aspect model [***] is a probabilistic latent-space model, which considers individual preferences as a convex combination of preference factors. the latent class variable is associated with each observation pair of a user and an item influence:1 type:2 pair index:57 citer id:61 citer title:scalable collaborative filtering using cluster-based smoothing citer abstract:memory-based approaches for collaborative filtering identify the similarity between two users by comparing their ratings on a set of items. in the past, the memory-based approaches have been shown to suffer from two fundamental problems: data sparsity and difficulty in scalability. alternatively, the model-based approaches have been proposed to alleviate these problems, but these approaches tends to limit the range of users. in this paper, we present a novel approach that combines the advantages of these two kinds of approaches by introducing a smoothing-based method. in our approach, clusters generated from the training data provide the basis for data smoothing and neighborhood selection. as a result, we provide higher accuracy as well as increased efficiency in recommendations. empirical studies on two datasets (eachmovie and movielens) show that our new proposed approach consistently outperforms other state-of-the-art collaborative filtering algorithms citee id:24 citee title:clustering for collaborative filtering applications citee abstract:collaborative filtering systems assist users to identify items of interest by providing predictions based on ratings of other users. the quality of the predictions depends strongly on the amount of available ratings and collaborative filtering algorithms perform poorly when only few ratings are available. in this paper we identify two important situations with sparse ratings: bootstrapping a collaborative filtering system with few users and providing recommendations for new users, who rated surrounding text:in order to predict the rating from an active user on a particular item, these approaches first categorize the active user into one or more of the predefined user classes and use the rating of the predicted classes on the targeted item as the prediction. algorithms within this category include bayesian network approach [4], clustering approach [***][21] and the aspect models [12]. the model-based approaches are often time-consuming to build and update, and cannot cover as diverse a user range as the memory-based approaches do - 1. 2 model-based approaches two popular model-based algorithms are the clustering for collaborative filtering [***][21] and the aspect models [12]. clustering techniques work by identifying groups of users who appear to have similar preferences influence:1 type:2 pair index:58 citer id:61 citer title:scalable collaborative filtering using cluster-based smoothing citer abstract:memory-based approaches for collaborative filtering identify the similarity between two users by comparing their ratings on a set of items. in the past, the memory-based approaches have been shown to suffer from two fundamental problems: data sparsity and difficulty in scalability. alternatively, the model-based approaches have been proposed to alleviate these problems, but these approaches tends to limit the range of users. in this paper, we present a novel approach that combines the advantages of these two kinds of approaches by introducing a smoothing-based method. in our approach, clusters generated from the training data provide the basis for data smoothing and neighborhood selection. as a result, we provide higher accuracy as well as increased efficiency in recommendations. empirical studies on two datasets (eachmovie and movielens) show that our new proposed approach consistently outperforms other state-of-the-art collaborative filtering algorithms citee id:31 citee title:collaborative filtering by personality diagnosis: a hybrid memory- and model-based approach citee abstract:the growth of internet commerce has stimulated the use of collaborative filtering (cf) algorithms as recommender systems. such systems leverage knowledge about the known preferences of multiple users to recommend items of interest to other users. cf methods have been harnessed to make recommendations about such items as web pages, movies, books, and toys. researchers have proposed and evaluated many approaches for generating recommendations. we describe and evaluate a new method called personality diagnosis (pd). given a users preferences for some items, we compute the probability that he or she is of the same personality type as other users, and, in turn, the probability that he or she will like new items. pd retains some of the advantages of traditional similarity-weighting techniques in that all data is brought to bear on each prediction and new data can be added easily and incrementally. additionally, pd has a meaningful probabilistic interpretation, which may be leveraged to justify, explain, and augment results. we report empirical results on the eachmovie database of movie ratings, and on user profile data collected from the citeseer digital library of computer science research papers. the probabilistic framework naturally supports a variety of descriptive measurementsin particular, we consider the applicability of a value of information (voi) computation. surrounding text:1. 3 hybrid model pennock et al$ [***] proposed a hybrid memory- and model-based approach. given a users preferences for some items, they compute the probability that a user belongs to the same personality diagnosis by assigning the missing rating as a uniform distribution over all possible ratings [***] - 3 hybrid model pennock et al$ [***] proposed a hybrid memory- and model-based approach. given a users preferences for some items, they compute the probability that a user belongs to the same personality diagnosis by assigning the missing rating as a uniform distribution over all possible ratings [***]. previous empirical studies have shown that the method is able to outperform several other approaches for collaborative filtering [***], including the pcc method, the vss method and the bayesian network approach - given a users preferences for some items, they compute the probability that a user belongs to the same personality diagnosis by assigning the missing rating as a uniform distribution over all possible ratings [***]. previous empirical studies have shown that the method is able to outperform several other approaches for collaborative filtering [***], including the pcc method, the vss method and the bayesian network approach. however, the method neither takes the whole aggregated information of the training database into account nor considers the diversity among users when rating the non-rated items influence:1 type:2 pair index:59 citer id:61 citer title:scalable collaborative filtering using cluster-based smoothing citer abstract:memory-based approaches for collaborative filtering identify the similarity between two users by comparing their ratings on a set of items. in the past, the memory-based approaches have been shown to suffer from two fundamental problems: data sparsity and difficulty in scalability. alternatively, the model-based approaches have been proposed to alleviate these problems, but these approaches tends to limit the range of users. in this paper, we present a novel approach that combines the advantages of these two kinds of approaches by introducing a smoothing-based method. in our approach, clusters generated from the training data provide the basis for data smoothing and neighborhood selection. as a result, we provide higher accuracy as well as increased efficiency in recommendations. empirical studies on two datasets (eachmovie and movielens) show that our new proposed approach consistently outperforms other state-of-the-art collaborative filtering algorithms citee id:33 citee title:grouplens: an open architecture for collaborative filtering of netnews citee abstract:collaborative filters help people make choices based on the opinions of other people. grouplens is a system for collaborative filtering of netnews, to help people find articles they will like in the huge stream of available articles. news reader clients display predicted scores and make it easy for users to rate articles after they read them. rating servers, called better bit bureaus, gather and disseminate the ratings. the rating servers predict scores based on the heuristic that people who surrounding text:memory-based algorithms perform the computation on the entire database to identify the top k most similar users to the active user from the training database in terms of the rating patterns and then combines those ratings together. notable examples include the pearson-correlation based approach [***], the vector similarity based approach [4], and the extended generalized vector-space model [20]. these approaches focused on utilizing the existing rating of a training user as the features, however, the memory-based method suffers from two fundamental problems: data sparsity and inability to scale up - the basic idea is to compute the active users predicted vote of an item as a weighted average of votes by other similar users or k nearest neighbors (knn). two commonly used memory-based algorithms are the pearson correlation coefficient (pcc) algorithm [***] and the vector space similarity (vss) algorithm [4]. these two approaches differ in the computation of similarity influence:1 type:2 pair index:60 citer id:61 citer title:scalable collaborative filtering using cluster-based smoothing citer abstract:memory-based approaches for collaborative filtering identify the similarity between two users by comparing their ratings on a set of items. in the past, the memory-based approaches have been shown to suffer from two fundamental problems: data sparsity and difficulty in scalability. alternatively, the model-based approaches have been proposed to alleviate these problems, but these approaches tends to limit the range of users. in this paper, we present a novel approach that combines the advantages of these two kinds of approaches by introducing a smoothing-based method. in our approach, clusters generated from the training data provide the basis for data smoothing and neighborhood selection. as a result, we provide higher accuracy as well as increased efficiency in recommendations. empirical studies on two datasets (eachmovie and movielens) show that our new proposed approach consistently outperforms other state-of-the-art collaborative filtering algorithms citee id:29 citee title:collaborative filtering and the generalized vector space model citee abstract:collaborative filtering is a technique for recommending documents to users based on how similar their tastes are to other users. if two users tend to agree on what they like, the system will recommend the same documents to them. the generalized vector space model of information retrieval represents a document by a vector of its similarities to all other documents. the process of collaborative filtering is nearly identical to the process of retrieval using gvsm in a matrix of user ratings. using this observation, a model for filtering collaboratively using document content is possible. surrounding text:memory-based algorithms perform the computation on the entire database to identify the top k most similar users to the active user from the training database in terms of the rating patterns and then combines those ratings together. notable examples include the pearson-correlation based approach [16], the vector similarity based approach [4], and the extended generalized vector-space model [***]. these approaches focused on utilizing the existing rating of a training user as the features, however, the memory-based method suffers from two fundamental problems: data sparsity and inability to scale up influence:1 type:2 pair index:61 citer id:61 citer title:scalable collaborative filtering using cluster-based smoothing citer abstract:memory-based approaches for collaborative filtering identify the similarity between two users by comparing their ratings on a set of items. in the past, the memory-based approaches have been shown to suffer from two fundamental problems: data sparsity and difficulty in scalability. alternatively, the model-based approaches have been proposed to alleviate these problems, but these approaches tends to limit the range of users. in this paper, we present a novel approach that combines the advantages of these two kinds of approaches by introducing a smoothing-based method. in our approach, clusters generated from the training data provide the basis for data smoothing and neighborhood selection. as a result, we provide higher accuracy as well as increased efficiency in recommendations. empirical studies on two datasets (eachmovie and movielens) show that our new proposed approach consistently outperforms other state-of-the-art collaborative filtering algorithms citee id:26 citee title:clustering methods for collaborative filtering citee abstract:grouping people into clusters based on the items they have purchased allows accurate recommendations of new items for purchase: if you and i have liked many of the same movies, then i will probably enjoy other movies that you like. recommending items based on similarity of interest (a.k.a. collaborative filtering) is attractive for many domains: books, cds, movies, etc., but does not always work well. because data are always sparse { any given person has seen only a small fraction of all movies { much more accurate predictions can be made by grouping people into clusters with similar movies and grouping movies into clusters which tend to be liked by the same people. finding optimal clusters is tricky because the movie groups should be used to help determine the people groups and visa versa. we present a formal statistical model of collaborative filtering, and compare di erent algorithms for estimating the model parameters including variations of k-means clustering and gibbs sampling. this formal model is easily extended to handle clustering of objects with multiple attributes surrounding text:in order to predict the rating from an active user on a particular item, these approaches first categorize the active user into one or more of the predefined user classes and use the rating of the predicted classes on the targeted item as the prediction. algorithms within this category include bayesian network approach [4], clustering approach [13][***] and the aspect models [12]. the model-based approaches are often time-consuming to build and update, and cannot cover as diverse a user range as the memory-based approaches do - 1. 2 model-based approaches two popular model-based algorithms are the clustering for collaborative filtering [13][***] and the aspect models [12]. clustering techniques work by identifying groups of users who appear to have similar preferences influence:1 type:2 pair index:62 citer id:61 citer title:scalable collaborative filtering using cluster-based smoothing citer abstract:memory-based approaches for collaborative filtering identify the similarity between two users by comparing their ratings on a set of items. in the past, the memory-based approaches have been shown to suffer from two fundamental problems: data sparsity and difficulty in scalability. alternatively, the model-based approaches have been proposed to alleviate these problems, but these approaches tends to limit the range of users. in this paper, we present a novel approach that combines the advantages of these two kinds of approaches by introducing a smoothing-based method. in our approach, clusters generated from the training data provide the basis for data smoothing and neighborhood selection. as a result, we provide higher accuracy as well as increased efficiency in recommendations. empirical studies on two datasets (eachmovie and movielens) show that our new proposed approach consistently outperforms other state-of-the-art collaborative filtering algorithms citee id:135 citee title:similarity measure and instance selection for collaborative filtering citee abstract:collaborative filtering has been very successful in both research and applications such as information filtering and e-commerce. the k-nearest neighbor (knn) method is a popular way for its realization. its key technique is to find k nearest neighbors for a given user to predict his interests. however, this method suffers from two fundamental problems: sparsity and scalability. in this paper, we present our solutions for these two problems. we adopt two techniques: a matrix conversion method for similarity measure and an instance selection method. and then we present an improved collaborative filtering algorithm based on these two methods. in contrast with existing collaborative algorithms, our method shows its satisfactory accuracy and performance surrounding text:principle component analysis (pca) [8] and information retrieval techniques such as latent semantic indexing (lsi) [7][18] are also proposed. zeng [***] proposed to compute the users similarity by a matrix conversion method for similarity measure. the dimensionality-reduction approach addresses the sparsity problem by removing unrepresentative or insignificant users or items so as to condense the user-item matrix influence:1 type:2 pair index:63 citer id:64 citer title:eigenrank a ranking-oriented approach to collaborative filtering citer abstract:a recommender system must be able to suggest items that are likely to be preferred by the user. in most systems, the degree of preference is represented by a rating score. given a database of users past ratings on a set of items, traditional collaborative filtering algorithms are based on predicting the potential ratings that a user would assign to the unrated items so that they can be ranked by the predicted ratings to produce a list of recommended items. in this paper, we propose a collaborative filtering approach that addresses the item ranking problem directly by modeling user preferences derived from the ratings. we measure the similarity between users based on the correlation between their rankings of the items rather than the rating values and propose new collaborative filtering algorithms for ranking items based on the preferences of similar users. experimental results on real world movie rating data sets show that the proposed approach outperforms traditional collaborative filtering algorithms significantly on the ndcg measure for evaluating ranked results citee id:35 citee title:empirical analysis of predictive algorithms for collaborative filtering citee abstract:collaborative filtering or recommender systems use a database about user preferences to predict additional topics or products a new user might like. in this paper we describe several algorithms designed for this task, including techniques based on correlation coefficients, vector-based similarity calculations, and statistical bayesian methods. we compare the predictive accuracy of the various methods in a set of representative problem domains. we use two basic classes of evaluation metrics. the first characterizes accuracy over a set of individual predictions in terms of average absolute deviation. the second estimates the utility of a ranked list of suggested items. this metric uses an estimate of the probability that a user will see a recommendation in an ordered list. experiments were run for datasets associated with 3 application areas, 4 experimental protocols, and the 2 evaluation metrics for the various algorithms. results indicate that for a wide range of conditions, bayesian networks with decision trees at each node and correlation methods outperform bayesian-clustering and vector-similarity methods. between correlation and bayesian networks, the preferred method depends on the nature of the dataset, nature of the application (ranked versus one-by-one presentation), and the availability of votes with which to make predictions. other considerations include the size of database, speed of predictions, and learning time. surrounding text:besides avoiding the need for collecting extensive information about items or users, cf requires no domain knowledge and can be easily adopted in different recommender systems. collaborative filtering is usually adopted in two classes of application scenarios[***]. in the first class, a user is presented with one individual item at a time along with a predicted rating indicating the users potential interest in the item - a crucial component of the user-based model is the user-user similarity su,v that is used to select the set of neighbors. popular choices for su,v include the pearson correlation coefficient( pcc)[22, 11]and the vector similarity(vs)[***]. one difficulty in measuring the user-user similarity is that the raw ratings may contain biases caused by the different rating behaviors of different users - to correct such biases, different methods have been proposed to normalize or center the data prior to measuring user similarities. [22, ***] showed that by correcting for user-specific means the prediction quality could be improved. later, jin et al$ proposed a technique for normalizing the user ratings based on the halfway accumulative distribution[15] - 3. ratingoriented collaborative filtering in this section, we first describe several rating-based similarity measures that have been commonly used in neighborhoodbased cf approaches for finding similar users[22, ***] and similar items[24, 18]. we then discuss two models for rating prediction in neighborhood-based cf, namely user-based and item-based models influence:1 type:2 pair index:64 citer id:64 citer title:eigenrank a ranking-oriented approach to collaborative filtering citer abstract:a recommender system must be able to suggest items that are likely to be preferred by the user. in most systems, the degree of preference is represented by a rating score. given a database of users past ratings on a set of items, traditional collaborative filtering algorithms are based on predicting the potential ratings that a user would assign to the unrated items so that they can be ranked by the predicted ratings to produce a list of recommended items. in this paper, we propose a collaborative filtering approach that addresses the item ranking problem directly by modeling user preferences derived from the ratings. we measure the similarity between users based on the correlation between their rankings of the items rather than the rating values and propose new collaborative filtering algorithms for ranking items based on the preferences of similar users. experimental results on real world movie rating data sets show that the proposed approach outperforms traditional collaborative filtering algorithms significantly on the ndcg measure for evaluating ranked results citee id:65 citee title:learning to rank using gradient descent citee abstract:we investigate using gradient descent methods for learning ranking functions; we propose a simple probabilistic cost function, and we introduce ranknet, an implementation of these ideas using a neural network to model the underlying ranking function. we present test results on toy data and on data from a commercial internet search engine. surrounding text:most of the proposed methods are dedicated to ranking items represented in some feature space as is the setting for content-based filtering. given a set of ordered pairs of instances as training data, the different methods either try to learn an item scoring function[***, 16] or learn a classifier for classifying item pairs into two types of relations(correctly ordered vs. incorrectly ordered)[5, 6] - incorrectly ordered)[5, 6]. different machine learning models including svm, boosting and neural network have been used for learning such ranking functions, which led to methods such as ranking svm[16], ranknet[***] and rankboost[6]. 3 - in content-based filtering, the preference function is often realized by a binary classifier that categorizes each pair of items into two categories(correctly ranked and incorrectly ranked) based on their content features. various machine learning approaches including ensembles[5, 6], support vector machines[16] and neural networks[***] have been developed for learning such a binary classifier to model the preference function. however, in the collaborative filtering setting, such approaches could not be applied due to the lack of features for describing the items influence:1 type:1 pair index:65 citer id:64 citer title:eigenrank a ranking-oriented approach to collaborative filtering citer abstract:a recommender system must be able to suggest items that are likely to be preferred by the user. in most systems, the degree of preference is represented by a rating score. given a database of users past ratings on a set of items, traditional collaborative filtering algorithms are based on predicting the potential ratings that a user would assign to the unrated items so that they can be ranked by the predicted ratings to produce a list of recommended items. in this paper, we propose a collaborative filtering approach that addresses the item ranking problem directly by modeling user preferences derived from the ratings. we measure the similarity between users based on the correlation between their rankings of the items rather than the rating values and propose new collaborative filtering algorithms for ranking items based on the preferences of similar users. experimental results on real world movie rating data sets show that the proposed approach outperforms traditional collaborative filtering algorithms significantly on the ndcg measure for evaluating ranked results citee id:66 citee title:learning to order things citee abstract:there are many applications in which it is desirable to order rather than classify instances. here we consider the problem of learning how to order, given feedback in the form of preference judgments, i.e., statements to the effect that one instance should be ranked ahead of another. we outline a two-stage approach in which one first learns by conventional means a preference function, of the form preffi , which indicates whether it is advisable to rank  before  . new instances are then ordered so as to maximize agreements with the learned preference function. we show that the problem of finding the ordering that agrees best with a preference function is np-complete, even under very restrictive assumptions. nevertheless, we describe a simple greedy algorithm that is guaranteed to find a good approximation. we then discuss an on-line learning algorithm, based on the hedge algorithm, for finding a good linear combination of ranking experts. we use the ordering algorithm combined with the on-line learning algorithm to find a combination of search experts, each of which is a domain-specific query expansion strategy for a www search engine, and present experimental results that demonstrate the merits of our approach. surrounding text:given a set of ordered pairs of instances as training data, the different methods either try to learn an item scoring function[4, 16] or learn a classifier for classifying item pairs into two types of relations(correctly ordered vs. incorrectly ordered)[***, 6]. different machine learning models including svm, boosting and neural network have been used for learning such ranking functions, which led to methods such as ranking svm[16], ranknet[4] and rankboost[6] - in content-based filtering, the preference function is often realized by a binary classifier that categorizes each pair of items into two categories(correctly ranked and incorrectly ranked) based on their content features. various machine learning approaches including ensembles[***, 6], support vector machines[16] and neural networks[4] have been developed for learning such a binary classifier to model the preference function. however, in the collaborative filtering setting, such approaches could not be applied due to the lack of features for describing the items - however, cohen et al. [***] showed that finding the optimal ranking  is a np-complete problem based on reduction from the cyclic-ordering problem[7] and proposed an efficient greedy order algorithm for finding an approximately optimal ranking shown in algorithm 1 below: algorithm 1 greedy order input: an item set i; a preference function output: a ranking . 1: for each i 2 i do 2: (i) = pj2i (i, j) - it then deletes t from i and updates the potential values of the remaining items by removing the effects of t. the algorithm has a time complexity o(n2), where n denotes the number of items and it was shown in [***] that the ranking . produced by the greedy order algorithm has a value v () that is within a factor of 2 of the optimal, i influence:3 type:3 pair index:66 citer id:64 citer title:eigenrank a ranking-oriented approach to collaborative filtering citer abstract:a recommender system must be able to suggest items that are likely to be preferred by the user. in most systems, the degree of preference is represented by a rating score. given a database of users past ratings on a set of items, traditional collaborative filtering algorithms are based on predicting the potential ratings that a user would assign to the unrated items so that they can be ranked by the predicted ratings to produce a list of recommended items. in this paper, we propose a collaborative filtering approach that addresses the item ranking problem directly by modeling user preferences derived from the ratings. we measure the similarity between users based on the correlation between their rankings of the items rather than the rating values and propose new collaborative filtering algorithms for ranking items based on the preferences of similar users. experimental results on real world movie rating data sets show that the proposed approach outperforms traditional collaborative filtering algorithms significantly on the ndcg measure for evaluating ranked results citee id:8 citee title:an efficient boosting algorithm for combining preferences citee abstract:we study the problem of learning to accurately rank a set of objects by combining a given collection of ranking or preference functions. this problem of combining preferences arises in several applications, such as that of combining the results of different search engines, or the collaborativefiltering problem of ranking movies for a user based on the movie rankings provided by other users. in this work, we begin by presenting a formal framework for this general problem. we then describe and analyze an efficient algorithm called rankboost for combining preferences based on the boosting approach to machine learning. we give theoretical results describing the algorithms behavior both on the training data, and on new test data not seen during training. we also describe an efficient implementation of the algorithm for a particular restricted but common case. we next discuss two experiments we carried out to assess the performance of rankboost. in the first experiment, we used the algorithm to combine different web search strategies, each of which is a query expansion for a given domain. the second experiment is a collaborative-filtering task for making movie recommendations. surrounding text:given a set of ordered pairs of instances as training data, the different methods either try to learn an item scoring function[4, 16] or learn a classifier for classifying item pairs into two types of relations(correctly ordered vs. incorrectly ordered)[5, ***]. different machine learning models including svm, boosting and neural network have been used for learning such ranking functions, which led to methods such as ranking svm[16], ranknet[4] and rankboost[***] - incorrectly ordered)[5, ***]. different machine learning models including svm, boosting and neural network have been used for learning such ranking functions, which led to methods such as ranking svm[16], ranknet[4] and rankboost[***]. 3 - the magnitude of this preference function (i, j) indicates the strength of preference and a value of zero means that there is no preference between the two items. following [***], we assume that (i, i) = 0 for all i 2 i and that is anti-symmetric, i$e$ (i, j) = . (j, i) for all i, j 2 i - in content-based filtering, the preference function is often realized by a binary classifier that categorizes each pair of items into two categories(correctly ranked and incorrectly ranked) based on their content features. various machine learning approaches including ensembles[5, ***], support vector machines[16] and neural networks[4] have been developed for learning such a binary classifier to model the preference function. however, in the collaborative filtering setting, such approaches could not be applied due to the lack of features for describing the items influence:1 type:1 pair index:67 citer id:64 citer title:eigenrank a ranking-oriented approach to collaborative filtering citer abstract:a recommender system must be able to suggest items that are likely to be preferred by the user. in most systems, the degree of preference is represented by a rating score. given a database of users past ratings on a set of items, traditional collaborative filtering algorithms are based on predicting the potential ratings that a user would assign to the unrated items so that they can be ranked by the predicted ratings to produce a list of recommended items. in this paper, we propose a collaborative filtering approach that addresses the item ranking problem directly by modeling user preferences derived from the ratings. we measure the similarity between users based on the correlation between their rankings of the items rather than the rating values and propose new collaborative filtering algorithms for ranking items based on the preferences of similar users. experimental results on real world movie rating data sets show that the proposed approach outperforms traditional collaborative filtering algorithms significantly on the ndcg measure for evaluating ranked results citee id:45 citee title:computers and intractability: a guide to the theory of np-completeness citee abstract:the book opens with a description of why np-completeness is such a useful concept. it turns out that it's very difficult to prove that a given problem is guaranteed to take a computer a long time to solve. there is, however, a class of problems (the ones that are np-complete) that are provably equivalently difficult to solve. since nobody has yet found a good algorithm (defined here as one that runs in polynomial time), it's probably not worth working on finding a good algorithm to solve the problem exactly because so many smart people have already tried and failed. surrounding text:however, cohen et al. [5] showed that finding the optimal ranking  is a np-complete problem based on reduction from the cyclic-ordering problem[***] and proposed an efficient greedy order algorithm for finding an approximately optimal ranking shown in algorithm 1 below: algorithm 1 greedy order input: an item set i; a preference function output: a ranking . 1: for each i 2 i do 2: (i) = pj2i (i, j) influence:3 type:3 pair index:68 citer id:64 citer title:eigenrank a ranking-oriented approach to collaborative filtering citer abstract:a recommender system must be able to suggest items that are likely to be preferred by the user. in most systems, the degree of preference is represented by a rating score. given a database of users past ratings on a set of items, traditional collaborative filtering algorithms are based on predicting the potential ratings that a user would assign to the unrated items so that they can be ranked by the predicted ratings to produce a list of recommended items. in this paper, we propose a collaborative filtering approach that addresses the item ranking problem directly by modeling user preferences derived from the ratings. we measure the similarity between users based on the correlation between their rankings of the items rather than the rating values and propose new collaborative filtering algorithms for ranking items based on the preferences of similar users. experimental results on real world movie rating data sets show that the proposed approach outperforms traditional collaborative filtering algorithms significantly on the ndcg measure for evaluating ranked results citee id:67 citee title:eigentaste: a constant time collaborative filtering algorithm citee abstract:eigentaste is a collaborative filtering algorithm that uses universal queries to elicit real-valued user ratings on a common set of items and applies principal component analysis (pca) to the resulting dense subset of the ratings matrix. pca facilitates dimensionality reduction for offline clustering of users and rapid computation of recommendations. for a database of n users, standard nearest-neighbor techniques require o(n) processing time to compute recommendations, whereas eigentaste requires o(1) (constant) time. we compare eigentaste to alternative algorithms using data from jester, an online joke recommending system. jester has collected approximately 2,500,000 ratings from 57,000 users. we use the normalized mean absolute error (nmae) measure to compare performance of different algorithms. in the appendix we use uniform and normal distribution models to derive analytic estimates of nmae when predictions are random. on the jester dataset, eigentaste computes recommendations two orders of magnitude faster with no loss of accuracy. jester is online at: http://eigentaste.berkeley.edu surrounding text:another difficulty in user-based models arises from the fact that the known user-item ratings data is typically highly sparse, which makes it very hard to find highly similar neighbors for making accurate predictions. to alleviate such sparsity problem, different techniques have been proposed to fill in some of the unknown ratings in the matrix such as dimensionality reduction[***] and data-smoothing methods[25, 19]. an alternative form of the neighborhood-based approach is the item-based model[24, 18] influence:1 type:2 pair index:69 citer id:64 citer title:eigenrank a ranking-oriented approach to collaborative filtering citer abstract:a recommender system must be able to suggest items that are likely to be preferred by the user. in most systems, the degree of preference is represented by a rating score. given a database of users past ratings on a set of items, traditional collaborative filtering algorithms are based on predicting the potential ratings that a user would assign to the unrated items so that they can be ranked by the predicted ratings to produce a list of recommended items. in this paper, we propose a collaborative filtering approach that addresses the item ranking problem directly by modeling user preferences derived from the ratings. we measure the similarity between users based on the correlation between their rankings of the items rather than the rating values and propose new collaborative filtering algorithms for ranking items based on the preferences of similar users. experimental results on real world movie rating data sets show that the proposed approach outperforms traditional collaborative filtering algorithms significantly on the ndcg measure for evaluating ranked results citee id:68 citee title:topic-sensitive pagerank citee abstract:in the original pagerank algorithm for improving the ranking of search-query results, a single pagerank vector is computed, using the link structure of the web, to capture the relative \importance" ofweb pages, independent of any particular search query. to yield more accurate search results, we propose computing a set of pagerank vectors, biased using a set of representative topics, to capture more accurately the notion of importance with respect to a particular topic. by using these (precomputed) biased pagerank vectors to generate query-specific importance scores for pages at query time, we show that we can generate more accurate rankings than with a single, generic pagerank vector. for ordinary keyword search queries, we compute the topic-sensitive pagerank scores for pages satisfying the query using the topic of the query keywords. for searches done in context (e.g., when the search query is performed by highlighting words in a web page), we compute the topic-sensitive pagerank scores using the topic of the context in which the query appeared surrounding text:the reasoning is that a web surfer can sometimes teleportto other pages according to the probability distribution defined by v independent of the current page and the parameter  controls how often the surfer may teleport to another page rather than following the hyperlinks. a few follow up works to pagerank proposed different methods to bias the personalization vector v to take into account different types of information besides the link structure such as contents[***, 23] and user preferences[14]. in our random walk model for item ranking, we follow the similar idea to define a personalization vector vu = [pu(1), influence:1 type:1 pair index:70 citer id:64 citer title:eigenrank a ranking-oriented approach to collaborative filtering citer abstract:a recommender system must be able to suggest items that are likely to be preferred by the user. in most systems, the degree of preference is represented by a rating score. given a database of users past ratings on a set of items, traditional collaborative filtering algorithms are based on predicting the potential ratings that a user would assign to the unrated items so that they can be ranked by the predicted ratings to produce a list of recommended items. in this paper, we propose a collaborative filtering approach that addresses the item ranking problem directly by modeling user preferences derived from the ratings. we measure the similarity between users based on the correlation between their rankings of the items rather than the rating values and propose new collaborative filtering algorithms for ranking items based on the preferences of similar users. experimental results on real world movie rating data sets show that the proposed approach outperforms traditional collaborative filtering algorithms significantly on the ndcg measure for evaluating ranked results citee id:9 citee title:an empirical analysis of design choices in neighborhood-based collaborative filtering algorithms citee abstract:collaborative filtering systems predict a user's interest in new items based on the recommendations of other people with similar interests. instead of performing content indexing or content analysis, collaborative filtering systems rely entirely on interest ratings from members of a participating community. since predictions are based on human ratings, collaborative filtering systems have the potential to provide filtering based on complex attributes, such as quality, taste, or aesthetics. many implementations of collaborative filtering apply some variation of the neighborhood-based prediction algorithm. many variations of similarity metrics, weighting approaches, combination measures, and rating normalization have appeared in each implementation. for these parameters and others, there is no consensus as to which choice of technique is most appropriate for what situations, nor how significant an effect on accuracy each parameter has. consequently, every person implementing a collaborative filtering system must make hard design choices with little guidance. this article provides a set of recommendations to guide design of neighborhood-based prediction systems, based on the results of an empirical study. we apply an analysis framework that divides the neighborhood-based prediction approach into three components and then examines variants of the key parameters in each component. the three components identified are similarity computation, neighbor selection, and rating combination. surrounding text:a crucial component of the user-based model is the user-user similarity su,v that is used to select the set of neighbors. popular choices for su,v include the pearson correlation coefficient( pcc)[22, ***]and the vector similarity(vs)[2]. one difficulty in measuring the user-user similarity is that the raw ratings may contain biases caused by the different rating behaviors of different users influence:1 type:2 pair index:71 citer id:64 citer title:eigenrank a ranking-oriented approach to collaborative filtering citer abstract:a recommender system must be able to suggest items that are likely to be preferred by the user. in most systems, the degree of preference is represented by a rating score. given a database of users past ratings on a set of items, traditional collaborative filtering algorithms are based on predicting the potential ratings that a user would assign to the unrated items so that they can be ranked by the predicted ratings to produce a list of recommended items. in this paper, we propose a collaborative filtering approach that addresses the item ranking problem directly by modeling user preferences derived from the ratings. we measure the similarity between users based on the correlation between their rankings of the items rather than the rating values and propose new collaborative filtering algorithms for ranking items based on the preferences of similar users. experimental results on real world movie rating data sets show that the proposed approach outperforms traditional collaborative filtering algorithms significantly on the ndcg measure for evaluating ranked results citee id:56 citee title:latent semantic models for collaborative filtering citee abstract:collaborative filtering aims at learning predictive models of user preferences, interests or behavior from community data, that is, a database of available user preferences. in this article, we describe a new family of model-based algorithms designed for this task. these algorithms rely on a statistical modelling technique that introduces latent class variables in a mixture model setting to discover user communities and prototypical interest profiles. we investigate several variations to deal with discrete and continuous response variables as well as with different objective functions. the main advantages of this technique over standard memory-based methods are higher accuracy, constant time prediction, and an explicit and compact model representation. the latter can also be used to mine for user communitites. the experimental evaluation shows that substantial improvements in accucracy over existing methods and published results can be obtained. surrounding text:2 model-based approaches in contrast to the neighborhood-based approach, the modelbased approach to cf use the observed user-item ratings to train a compact model that explains the given data so that ratings could be predicted via the model instead of directly manipulating the original rating database as the neighborhood-based approach does. algorithms in this category include clustering methods[25], aspect models[***] and bayesian networks[21]. 2 influence:1 type:2 pair index:72 citer id:64 citer title:eigenrank a ranking-oriented approach to collaborative filtering citer abstract:a recommender system must be able to suggest items that are likely to be preferred by the user. in most systems, the degree of preference is represented by a rating score. given a database of users past ratings on a set of items, traditional collaborative filtering algorithms are based on predicting the potential ratings that a user would assign to the unrated items so that they can be ranked by the predicted ratings to produce a list of recommended items. in this paper, we propose a collaborative filtering approach that addresses the item ranking problem directly by modeling user preferences derived from the ratings. we measure the similarity between users based on the correlation between their rankings of the items rather than the rating values and propose new collaborative filtering algorithms for ranking items based on the preferences of similar users. experimental results on real world movie rating data sets show that the proposed approach outperforms traditional collaborative filtering algorithms significantly on the ndcg measure for evaluating ranked results citee id:46 citee title:cumulated gain-based evaluation of ir techniques citee abstract:modern large retrieval environments tend to overwhelm their users by their large output. since all documents are not of equal relevance to their users, highly relevant documents should be identified and ranked first for presentation. in order to develop ir techniques in this direction, it is necessary to develop evaluation approaches and methods that credit ir methods for their ability to retrieve highly relevant documents. this can be done by extending traditional evaluation methods, that is, recall and precision based on binary relevance judgments, to graded relevance judgments. alternatively, novel measures based on graded relevance judgments may be developed. this article proposes several novel measures that compute the cumulative gain the user obtains by examining the retrieval result up to a given ranked position. the first one accumulates the relevance scores of retrieved documents along the ranked result list. the second one is similar but applies a discount factor to the relevance scores in order to devaluate late-retrieved documents. the third one computes the relative-to-the-ideal performance of ir techniques, based on the cumulative gain they are able to yield. these novel measures are defined and discussed and their use is demonstrated in a case study using trec data: sample system run results for 20 queries in trec-7. as a relevance base we used novel graded relevance judgments on a four-point scale. the test results indicate that the proposed measures credit ir methods for their ability to retrieve highly relevant documents and allow testing of statistical significance of effectiveness differences. the graphs based on the measures also provide insight into the performance ir techniques and allow interpretation, for example, from the user point of view. surrounding text:commonly used measures for accuracy include the mean absolute error (mae) and the root mean square error (rmse), both of which depend on difference between true rating and predicted rating. however, since the emphasis of our work is on improving item rankings instead of rating prediction, we employ the normalized discounted cumulative gain(ndcg)[***] metric, which is an increasingly popular metric for evaluating ranked results in information retrieval where the documents are assigned graded rather than binary relevance judgements. for collaborative filtering applications, the ratings on items assigned by the users can naturally serve as the graded relevance judgements influence:3 type:3 pair index:73 citer id:64 citer title:eigenrank a ranking-oriented approach to collaborative filtering citer abstract:a recommender system must be able to suggest items that are likely to be preferred by the user. in most systems, the degree of preference is represented by a rating score. given a database of users past ratings on a set of items, traditional collaborative filtering algorithms are based on predicting the potential ratings that a user would assign to the unrated items so that they can be ranked by the predicted ratings to produce a list of recommended items. in this paper, we propose a collaborative filtering approach that addresses the item ranking problem directly by modeling user preferences derived from the ratings. we measure the similarity between users based on the correlation between their rankings of the items rather than the rating values and propose new collaborative filtering algorithms for ranking items based on the preferences of similar users. experimental results on real world movie rating data sets show that the proposed approach outperforms traditional collaborative filtering algorithms significantly on the ndcg measure for evaluating ranked results citee id:69 citee title:scaling personalized web search citee abstract:recent web search techniques augment traditional text matching with a global notion of ``importance'' based on the linkage structure of the web, such as in google's pagerank algorithm. for more refined searches, this global notion of importance can be specialized to create personalized views of importance--for example, importance scores can be biased according to a user-specified set of initially-interesting pages. computing and storing all possible personalized views in advance is impractical, as is computing personalized views at query time, since the computation of each view requires an iterative computation over the web graph. we present new graph-theoretical results, and a new technique based on these results, that encode personalized views as partial vectors. partial vectors are shared across multiple personalized views, and their computation and storage costs scale well with the number of views. our approach enables incremental computation, so that the construction of personalized views from partial vectors is practical at query time. we present efficient dynamic programming algorithms for computing partial vectors, an algorithm for constructing personalized views from partial vectors, and experimental results demonstrating the effectiveness and scalability of our techniques. surrounding text:the reasoning is that a web surfer can sometimes teleportto other pages according to the probability distribution defined by v independent of the current page and the parameter  controls how often the surfer may teleport to another page rather than following the hyperlinks. a few follow up works to pagerank proposed different methods to bias the personalization vector v to take into account different types of information besides the link structure such as contents[10, 23] and user preferences[***]. in our random walk model for item ranking, we follow the similar idea to define a personalization vector vu = [pu(1), influence:2 type:3,1 pair index:74 citer id:64 citer title:eigenrank a ranking-oriented approach to collaborative filtering citer abstract:a recommender system must be able to suggest items that are likely to be preferred by the user. in most systems, the degree of preference is represented by a rating score. given a database of users past ratings on a set of items, traditional collaborative filtering algorithms are based on predicting the potential ratings that a user would assign to the unrated items so that they can be ranked by the predicted ratings to produce a list of recommended items. in this paper, we propose a collaborative filtering approach that addresses the item ranking problem directly by modeling user preferences derived from the ratings. we measure the similarity between users based on the correlation between their rankings of the items rather than the rating values and propose new collaborative filtering algorithms for ranking items based on the preferences of similar users. experimental results on real world movie rating data sets show that the proposed approach outperforms traditional collaborative filtering algorithms significantly on the ndcg measure for evaluating ranked results citee id:34 citee title:collaborative filtering with decoupled models for preferences and ratings citee abstract:in this paper, we describe a new model for collaborative filtering. the motivation of this work comes from the fact that two users with very similar preferences on items may have very different rating schemes. for example, one user may tend to assign a higher rating to all items than another user. unlike previous models of collaborative filtering, which determine the similarity between two users only based on their rating performance, our model treats the users preferences on items separately from the users rating scheme. more specifically, for each user, we build two separate models: a preference model capturing which items are favored by the user and a rating model capturing how the user would rate an item given the preference information. the similarity of two users is computed based on the underlying preference model, instead of the surface ratings. we compare the new model with several representative previous approaches on two data sets. experiment results show that the new model outperforms all the previous approaches that are tested consistently on both data sets surrounding text:[22, 2] showed that by correcting for user-specific means the prediction quality could be improved. later, jin et al$ proposed a technique for normalizing the user ratings based on the halfway accumulative distribution[***]. another difficulty in user-based models arises from the fact that the known user-item ratings data is typically highly sparse, which makes it very hard to find highly similar neighbors for making accurate predictions influence:1 type:2 pair index:75 citer id:64 citer title:eigenrank a ranking-oriented approach to collaborative filtering citer abstract:a recommender system must be able to suggest items that are likely to be preferred by the user. in most systems, the degree of preference is represented by a rating score. given a database of users past ratings on a set of items, traditional collaborative filtering algorithms are based on predicting the potential ratings that a user would assign to the unrated items so that they can be ranked by the predicted ratings to produce a list of recommended items. in this paper, we propose a collaborative filtering approach that addresses the item ranking problem directly by modeling user preferences derived from the ratings. we measure the similarity between users based on the correlation between their rankings of the items rather than the rating values and propose new collaborative filtering algorithms for ranking items based on the preferences of similar users. experimental results on real world movie rating data sets show that the proposed approach outperforms traditional collaborative filtering algorithms significantly on the ndcg measure for evaluating ranked results citee id:70 citee title:optimizing search engines using clickthrough data citee abstract:this paper presents an approach to automatically optimizing the retrieval quality of search engines using clickthrough data. intuitively, a good information retrieval system should present relevant documents high in the ranking, with less relevant documents following below. while previous approaches to learning retrieval functions from examples exist, they typically require training data generated from relevance judgments by experts. this makes them difficult and expensive to apply. the goal of this paper is to develop a method that utilizes clickthrough data for training, namely the query-log of the search engine in connection with the log of links the users clicked on in the presented ranking. such clickthrough data is available in abundance and can be recorded at very low cost. taking a support vector machine (svm) approach, this paper presents a method for learning retrieval functions. from a theoretical perspective, this method is shown to be well-founded in a risk minimization framework. furthermore, it is shown to be feasible even for large sets of queries and features. the theoretical results are verified in a controlled experiment. it shows that the method can effectively adapt the retrieval function of a meta-search engine to a particular group of users, outperforming google in terms of retrieval quality after only a couple of hundred training examples surrounding text:most of the proposed methods are dedicated to ranking items represented in some feature space as is the setting for content-based filtering. given a set of ordered pairs of instances as training data, the different methods either try to learn an item scoring function[4, ***] or learn a classifier for classifying item pairs into two types of relations(correctly ordered vs. incorrectly ordered)[5, 6] - incorrectly ordered)[5, 6]. different machine learning models including svm, boosting and neural network have been used for learning such ranking functions, which led to methods such as ranking svm[***], ranknet[4] and rankboost[6]. 3 - in content-based filtering, the preference function is often realized by a binary classifier that categorizes each pair of items into two categories(correctly ranked and incorrectly ranked) based on their content features. various machine learning approaches including ensembles[5, 6], support vector machines[***] and neural networks[4] have been developed for learning such a binary classifier to model the preference function. however, in the collaborative filtering setting, such approaches could not be applied due to the lack of features for describing the items influence:3 type:1 pair index:76 citer id:64 citer title:eigenrank a ranking-oriented approach to collaborative filtering citer abstract:a recommender system must be able to suggest items that are likely to be preferred by the user. in most systems, the degree of preference is represented by a rating score. given a database of users past ratings on a set of items, traditional collaborative filtering algorithms are based on predicting the potential ratings that a user would assign to the unrated items so that they can be ranked by the predicted ratings to produce a list of recommended items. in this paper, we propose a collaborative filtering approach that addresses the item ranking problem directly by modeling user preferences derived from the ratings. we measure the similarity between users based on the correlation between their rankings of the items rather than the rating values and propose new collaborative filtering algorithms for ranking items based on the preferences of similar users. experimental results on real world movie rating data sets show that the proposed approach outperforms traditional collaborative filtering algorithms significantly on the ndcg measure for evaluating ranked results citee id:28 citee title:grouplens: applying collaborative filtering to usenet news citee abstract:the grouplens project designed, implemented, and evaluated a collaborative filtering system for usenet newsa high-volume, high-turnover discussion list service on the internet. usenet newsgroupsthe individual discussion listsmay carry hundreds of messages each day. while in theory the newsgroup organization allows readers to select the content that most interests them, in practice surrounding text:in the first class, a user is presented with one individual item at a time along with a predicted rating indicating the users potential interest in the item. an example of this category is grouplens[***], a collaborative filtering system for usenet news. the second class of applications produce an ordered list of top-n recommended items where the highest ranked items are predicted to be most preferred by the user influence:1 type:2 pair index:77 citer id:64 citer title:eigenrank a ranking-oriented approach to collaborative filtering citer abstract:a recommender system must be able to suggest items that are likely to be preferred by the user. in most systems, the degree of preference is represented by a rating score. given a database of users past ratings on a set of items, traditional collaborative filtering algorithms are based on predicting the potential ratings that a user would assign to the unrated items so that they can be ranked by the predicted ratings to produce a list of recommended items. in this paper, we propose a collaborative filtering approach that addresses the item ranking problem directly by modeling user preferences derived from the ratings. we measure the similarity between users based on the correlation between their rankings of the items rather than the rating values and propose new collaborative filtering algorithms for ranking items based on the preferences of similar users. experimental results on real world movie rating data sets show that the proposed approach outperforms traditional collaborative filtering algorithms significantly on the ndcg measure for evaluating ranked results citee id:6 citee title:amazon.com recommendations: item-to-item collaborative filtering citee abstract:recommendation algorithms are best known for their use on e-commerce web sites,1 where they use input about a customers interests to generate a list of recommended items. many applications use only the items that customers purchase and explicitly rate to represent their interests, but they can also use other attributes, including items viewed, demographic data, subject interests, and favorite artists. at amazon.com, we use recommendation algorithms to personalize the online store for each customer. the store radically changes based on customer interests, showing programming titles to a software engineer and baby toys to a new mother. the click-through and conversion rates two important measures of web-based and email advertising effectiveness vastly exceed those of untargeted content such as banner advertisements and top-seller lists. surrounding text:to alleviate such sparsity problem, different techniques have been proposed to fill in some of the unknown ratings in the matrix such as dimensionality reduction[8] and data-smoothing methods[25, 19]. an alternative form of the neighborhood-based approach is the item-based model[24, ***]. here the item-item similarity is used to select a set of neighboring items that have been rated by the target user and the ratings on the unrated items are predicted based on his ratings on the neighboring items - 3. ratingoriented collaborative filtering in this section, we first describe several rating-based similarity measures that have been commonly used in neighborhoodbased cf approaches for finding similar users[22, 2] and similar items[24, ***]. we then discuss two models for rating prediction in neighborhood-based cf, namely user-based and item-based models influence:1 type:2 pair index:78 citer id:64 citer title:eigenrank a ranking-oriented approach to collaborative filtering citer abstract:a recommender system must be able to suggest items that are likely to be preferred by the user. in most systems, the degree of preference is represented by a rating score. given a database of users past ratings on a set of items, traditional collaborative filtering algorithms are based on predicting the potential ratings that a user would assign to the unrated items so that they can be ranked by the predicted ratings to produce a list of recommended items. in this paper, we propose a collaborative filtering approach that addresses the item ranking problem directly by modeling user preferences derived from the ratings. we measure the similarity between users based on the correlation between their rankings of the items rather than the rating values and propose new collaborative filtering algorithms for ranking items based on the preferences of similar users. experimental results on real world movie rating data sets show that the proposed approach outperforms traditional collaborative filtering algorithms significantly on the ndcg measure for evaluating ranked results citee id:53 citee title:effective missing data prediction for collaborative filtering citee abstract:memory-based collaborative filtering algorithms have been widely adopted in many popular recommender systems, although these approaches all suffer from data sparsity and poor prediction quality problems. usually, the user-item matrix is quite sparse, which directly leads to inaccurate recommendations. this paper focuses the memory-based collaborative filtering problems on two crucial factors: (1) similarity computation between users or items and (2) missing data prediction algorithms. first, we use the enhanced pearson correlation coefficient (pcc) algorithm by adding one parameter which overcomes the potential decrease of accuracy when computing the similarity of users or items. second, we propose an effective missing data prediction algorithm, in which information of both users and items is taken into account. in this algorithm, we set the similarity threshold for users and items respectively, and the prediction algorithm will determine whether predicting the missing data or not. we also address how to predict the missing data by employing a combination of user and item information. finally, empirical studies on dataset movielens have shown that our newly proposed method outperforms other stateofthe-art collaborative filtering algorithms and it is more robust against data sparsity surrounding text:another difficulty in user-based models arises from the fact that the known user-item ratings data is typically highly sparse, which makes it very hard to find highly similar neighbors for making accurate predictions. to alleviate such sparsity problem, different techniques have been proposed to fill in some of the unknown ratings in the matrix such as dimensionality reduction[8] and data-smoothing methods[25, ***]. an alternative form of the neighborhood-based approach is the item-based model[24, 18] influence:1 type:2 pair index:79 citer id:64 citer title:eigenrank a ranking-oriented approach to collaborative filtering citer abstract:a recommender system must be able to suggest items that are likely to be preferred by the user. in most systems, the degree of preference is represented by a rating score. given a database of users past ratings on a set of items, traditional collaborative filtering algorithms are based on predicting the potential ratings that a user would assign to the unrated items so that they can be ranked by the predicted ratings to produce a list of recommended items. in this paper, we propose a collaborative filtering approach that addresses the item ranking problem directly by modeling user preferences derived from the ratings. we measure the similarity between users based on the correlation between their rankings of the items rather than the rating values and propose new collaborative filtering algorithms for ranking items based on the preferences of similar users. experimental results on real world movie rating data sets show that the proposed approach outperforms traditional collaborative filtering algorithms significantly on the ndcg measure for evaluating ranked results citee id:30 citee title:collaborative filtering by personality diagnosis: a hybrid memory and model-based approach citee abstract:the growth of internet commerce has stimulated the use of collaborative filtering (cf) algorithms as recommender systems. such systems leverage knowledge about the known preferences of multiple users to recommend items of interest to other users. cf methods have been harnessed to make recommendations about such items as web pages, movies, books, and toys. researchers have proposed and evaluated many approaches for generating recommendations. we describe and evaluate a new method called personality diagnosis (pd). given a users preferences for some items, we compute the probability that he or she is of the same personality type as other users, and, in turn, the probability that he or she will like new items. pd retains some of the advantages of traditional similarity-weighting techniques in that all data is brought to bear on each prediction and new data can be added easily and incrementally. additionally, pd has a meaningful probabilistic interpretation, which may be leveraged to justify, explain, and augment results. we report empirical results on the eachmovie database of movie ratings, and on user profile data collected from the citeseer digital library of computer science research papers. the probabilistic framework naturally supports a variety of descriptive measurementsin particular, we consider the applicability of a value of information (voi) computation. surrounding text:2 model-based approaches in contrast to the neighborhood-based approach, the modelbased approach to cf use the observed user-item ratings to train a compact model that explains the given data so that ratings could be predicted via the model instead of directly manipulating the original rating database as the neighborhood-based approach does. algorithms in this category include clustering methods[25], aspect models[12] and bayesian networks[***]. 2 influence:1 type:2 pair index:80 citer id:64 citer title:eigenrank a ranking-oriented approach to collaborative filtering citer abstract:a recommender system must be able to suggest items that are likely to be preferred by the user. in most systems, the degree of preference is represented by a rating score. given a database of users past ratings on a set of items, traditional collaborative filtering algorithms are based on predicting the potential ratings that a user would assign to the unrated items so that they can be ranked by the predicted ratings to produce a list of recommended items. in this paper, we propose a collaborative filtering approach that addresses the item ranking problem directly by modeling user preferences derived from the ratings. we measure the similarity between users based on the correlation between their rankings of the items rather than the rating values and propose new collaborative filtering algorithms for ranking items based on the preferences of similar users. experimental results on real world movie rating data sets show that the proposed approach outperforms traditional collaborative filtering algorithms significantly on the ndcg measure for evaluating ranked results citee id:33 citee title:grouplens: an open architecture for collaborative filtering of netnews citee abstract:collaborative filters help people make choices based on the opinions of other people. grouplens is a system for collaborative filtering of netnews, to help people find articles they will like in the huge stream of available articles. news reader clients display predicted scores and make it easy for users to rate articles after they read them. rating servers, called better bit bureaus, gather and disseminate the ratings. the rating servers predict scores based on the heuristic that people who surrounding text:a crucial component of the user-based model is the user-user similarity su,v that is used to select the set of neighbors. popular choices for su,v include the pearson correlation coefficient( pcc)[***, 11]and the vector similarity(vs)[2]. one difficulty in measuring the user-user similarity is that the raw ratings may contain biases caused by the different rating behaviors of different users - to correct such biases, different methods have been proposed to normalize or center the data prior to measuring user similarities. [***, 2] showed that by correcting for user-specific means the prediction quality could be improved. later, jin et al$ proposed a technique for normalizing the user ratings based on the halfway accumulative distribution[15] - 3. ratingoriented collaborative filtering in this section, we first describe several rating-based similarity measures that have been commonly used in neighborhoodbased cf approaches for finding similar users[***, 2] and similar items[24, 18]. we then discuss two models for rating prediction in neighborhood-based cf, namely user-based and item-based models influence:1 type:2 pair index:81 citer id:64 citer title:eigenrank a ranking-oriented approach to collaborative filtering citer abstract:a recommender system must be able to suggest items that are likely to be preferred by the user. in most systems, the degree of preference is represented by a rating score. given a database of users past ratings on a set of items, traditional collaborative filtering algorithms are based on predicting the potential ratings that a user would assign to the unrated items so that they can be ranked by the predicted ratings to produce a list of recommended items. in this paper, we propose a collaborative filtering approach that addresses the item ranking problem directly by modeling user preferences derived from the ratings. we measure the similarity between users based on the correlation between their rankings of the items rather than the rating values and propose new collaborative filtering algorithms for ranking items based on the preferences of similar users. experimental results on real world movie rating data sets show that the proposed approach outperforms traditional collaborative filtering algorithms significantly on the ndcg measure for evaluating ranked results citee id:71 citee title:the intelligent surfer: probabilistic combination of link and content information in pagerank citee abstract:the pagerank algorithm, used in the google search engine, greatly improves the results of web search by taking into account the link structure of the web. pagerank assigns to a page a score proportional to the number of times a random surfer would visit that page, if it surfed indefinitely from page to page, following all outlinks from a page with equal probability. we propose to improve pagerank by using a more intelligent surfer, one that is guided by a probabilistic model of the relevance of a page to a query. efficient execution of our algorithm at query time is made possible by precomputing at crawl time (and thus once for all queries) the necessary terms. experiments on two large subsets of the web indicate that our algorithm significantly outperforms pagerank in the (humanrated) quality of the pages returned, while remaining efficient enough to be used in todays large search engines. surrounding text:the reasoning is that a web surfer can sometimes teleportto other pages according to the probability distribution defined by v independent of the current page and the parameter  controls how often the surfer may teleport to another page rather than following the hyperlinks. a few follow up works to pagerank proposed different methods to bias the personalization vector v to take into account different types of information besides the link structure such as contents[10, ***] and user preferences[14]. in our random walk model for item ranking, we follow the similar idea to define a personalization vector vu = [pu(1), influence:2 type:1,3 pair index:82 citer id:64 citer title:eigenrank a ranking-oriented approach to collaborative filtering citer abstract:a recommender system must be able to suggest items that are likely to be preferred by the user. in most systems, the degree of preference is represented by a rating score. given a database of users past ratings on a set of items, traditional collaborative filtering algorithms are based on predicting the potential ratings that a user would assign to the unrated items so that they can be ranked by the predicted ratings to produce a list of recommended items. in this paper, we propose a collaborative filtering approach that addresses the item ranking problem directly by modeling user preferences derived from the ratings. we measure the similarity between users based on the correlation between their rankings of the items rather than the rating values and propose new collaborative filtering algorithms for ranking items based on the preferences of similar users. experimental results on real world movie rating data sets show that the proposed approach outperforms traditional collaborative filtering algorithms significantly on the ndcg measure for evaluating ranked results citee id:58 citee title:item-based collaborative filtering recommendation algorithms citee abstract:recommender systems apply knowledge discovery techniques to the problem of making personalized recommendations for information, products or services during a live interaction. these systems, especially the k-nearest neighbor collaborative filtering based ones, are achieving widespread success on the web. the tremendous growth in the amount of available information and the number of visitors to web sites in recent years poses some key challenges for recommender systems. these are: producing high quality recommendations, performing many recommendations per second for millions of users and items and achieving high coverage in the face of data sparsity. in traditional collaborative filtering systems the amount of work increases with the number of participants in the system. new recommender system technologies are needed that can quickly produce high quality recommendations, even for very large-scale problems. to address these issues we have explored item-based collaborative filtering techniques. item-based techniques first analyze the user-item matrix to identify relationships between di erent items, and then use these relationships to indirectly compute recommendations for users. in this paper we analyze di erent item-based recommendation generation algorithms. we look into di erent techniques for computing item-item similarities (e.g., item-item correlation vs. cosine similarities between item vectors) and di erent techniques for obtaining recommendations from them (e.g., weighted sum vs. regression model). finally, we experimentally evaluate our results and compare them to the basic k-nearest neighbor approach. our experiments suggest that item-based algorithms provide dramatically better performance than user-based algorithms, while at the same time providing better quality than the best available userbased algorithms surrounding text:to alleviate such sparsity problem, different techniques have been proposed to fill in some of the unknown ratings in the matrix such as dimensionality reduction[8] and data-smoothing methods[25, 19]. an alternative form of the neighborhood-based approach is the item-based model[***, 18]. here the item-item similarity is used to select a set of neighboring items that have been rated by the target user and the ratings on the unrated items are predicted based on his ratings on the neighboring items - sarwar et al. [***] recommended using the adjusted cosine similarity to compute the item-item similarity and found that the itembased model could obtain higher accuracy than the userbased model, while allowing more efficient computations. 2 - 3. ratingoriented collaborative filtering in this section, we first describe several rating-based similarity measures that have been commonly used in neighborhoodbased cf approaches for finding similar users[22, 2] and similar items[***, 18]. we then discuss two models for rating prediction in neighborhood-based cf, namely user-based and item-based models - rv)21/2 (1) 3. 2 vector similarity another way of measuring user-user similarity is to view each user as a vector in a high dimensional vector space based on his ratings so that the cosine of the angle between the two corresponding vectors can be used to measure their similarities: su,v = x i2iu\iv ru,i rv,i  x i2iu\iv r2 u,i x i2iu\iv r2 v,i1/2 (2) for measuring item-item similarity in item-based models, the adjusted cosine similarity has been shown to be most effective [***]: si,j = x u2ui\uj (ru,i . ru)(ru,j - epsilon results on eachmovie (above) and netflix (below) correlation coefficient (gokrcc). for ipcc and ivs, we set the number of neighboring items to be 50 as suggested by [***]. for all the other algorithm, we set the size of the neighborhood nu to be 100 influence:1 type:2 pair index:83 citer id:64 citer title:eigenrank a ranking-oriented approach to collaborative filtering citer abstract:a recommender system must be able to suggest items that are likely to be preferred by the user. in most systems, the degree of preference is represented by a rating score. given a database of users past ratings on a set of items, traditional collaborative filtering algorithms are based on predicting the potential ratings that a user would assign to the unrated items so that they can be ranked by the predicted ratings to produce a list of recommended items. in this paper, we propose a collaborative filtering approach that addresses the item ranking problem directly by modeling user preferences derived from the ratings. we measure the similarity between users based on the correlation between their rankings of the items rather than the rating values and propose new collaborative filtering algorithms for ranking items based on the preferences of similar users. experimental results on real world movie rating data sets show that the proposed approach outperforms traditional collaborative filtering algorithms significantly on the ndcg measure for evaluating ranked results citee id:61 citee title:scalable collaborative filtering using cluster-based smoothing citee abstract:memory-based approaches for collaborative filtering identify the similarity between two users by comparing their ratings on a set of items. in the past, the memory-based approaches have been shown to suffer from two fundamental problems: data sparsity and difficulty in scalability. alternatively, the model-based approaches have been proposed to alleviate these problems, but these approaches tends to limit the range of users. in this paper, we present a novel approach that combines the advantages of these two kinds of approaches by introducing a smoothing-based method. in our approach, clusters generated from the training data provide the basis for data smoothing and neighborhood selection. as a result, we provide higher accuracy as well as increased efficiency in recommendations. empirical studies on two datasets (eachmovie and movielens) show that our new proposed approach consistently outperforms other state-of-the-art collaborative filtering algorithms surrounding text:another difficulty in user-based models arises from the fact that the known user-item ratings data is typically highly sparse, which makes it very hard to find highly similar neighbors for making accurate predictions. to alleviate such sparsity problem, different techniques have been proposed to fill in some of the unknown ratings in the matrix such as dimensionality reduction[8] and data-smoothing methods[***, 19]. an alternative form of the neighborhood-based approach is the item-based model[24, 18] - 2 model-based approaches in contrast to the neighborhood-based approach, the modelbased approach to cf use the observed user-item ratings to train a compact model that explains the given data so that ratings could be predicted via the model instead of directly manipulating the original rating database as the neighborhood-based approach does. algorithms in this category include clustering methods[***], aspect models[12] and bayesian networks[21]. 2 influence:1 type:2 pair index:84 citer id:68 citer title:topic-sensitive pagerank citer abstract:in the original pagerank algorithm for improving the ranking of search-query results, a single pagerank vector is computed, using the link structure of the web, to capture the relative \importance" ofweb pages, independent of any particular search query. to yield more accurate search results, we propose computing a set of pagerank vectors, biased using a set of representative topics, to capture more accurately the notion of importance with respect to a particular topic. by using these (precomputed) biased pagerank vectors to generate query-specific importance scores for pages at query time, we show that we can generate more accurate rankings than with a single, generic pagerank vector. for ordinary keyword search queries, we compute the topic-sensitive pagerank scores for pages satisfying the query using the topic of the query keywords. for searches done in context (e.g., when the search query is performed by highlighting words in a web page), we compute the topic-sensitive pagerank scores using the topic of the context in which the query appeared citee id:86 citee title:improved algorithms for topic distillation in a hyperlinked environment citee abstract:this paper addresses the problem of topic distillation on theworld wideweb, namely, given a typical user query to find quality documents related to the query topic. connectivity analysis has been shown to be useful in identifying high quality pages within a topic specific graph of hyperlinked documents. the essence of our approach is to augment a previous connectivity analysis based algorithm with content analysis. we identify three problems with the existing approach and devise algorithms to tackle them. the results of a user evaluation are reported that show an improvement of precision at 10 documents by at least 45% over pure connectivity analysis surrounding text:algorithm proposed in [14] relies on query-time processing to deduce the hubs and authorities that exist in a subgraph of the web consisting of both the results to a query and the local neighborhood of these results. [***] augments the hits algorithm with content analysis to improve precision for the task of retrieving documents related to a query topic (as opposed to retrieving documents that exactly satisfy the user's information need). [8] makes use of hits for automatically compiling resource lists for general topics influence:2 type:2 pair index:85 citer id:68 citer title:topic-sensitive pagerank citer abstract:in the original pagerank algorithm for improving the ranking of search-query results, a single pagerank vector is computed, using the link structure of the web, to capture the relative \importance" ofweb pages, independent of any particular search query. to yield more accurate search results, we propose computing a set of pagerank vectors, biased using a set of representative topics, to capture more accurately the notion of importance with respect to a particular topic. by using these (precomputed) biased pagerank vectors to generate query-specific importance scores for pages at query time, we show that we can generate more accurate rankings than with a single, generic pagerank vector. for ordinary keyword search queries, we compute the topic-sensitive pagerank scores for pages satisfying the query using the topic of the query keywords. for searches done in context (e.g., when the search query is performed by highlighting words in a web page), we compute the topic-sensitive pagerank scores using the topic of the context in which the query appeared citee id:108 citee title:the anatomy of a large-scale hypertextual web search engine citee abstract:in this paper, we present google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. google is designed to crawl and index the web efficiently and produce much more satisfying search results than existing systems. the prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/ to engineer a search engine is a challenging task. search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. they answer tens of millions of queries every day. despite the importance of large-scale search engines on the web, very little academic research has been done on them. furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago. this paper provides an in-depth description of our large-scale web search engine -- the first such detailed public description we know of to date. apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. this paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext. also we look at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want surrounding text:[8] makes use of hits for automatically compiling resource lists for general topics. the pagerank algorithm discussed in [***, 16] precomputes a rank vector that provides a-priori \importance" estimates for all of the pages on the web. this vector is computed once, oine, and is independent of the search query - 2. review of pagerank a review of the pagerank algorithm ([16, ***, 11]) follows. the basic idea of pagerank is that if page u has a link to page v, then the author of u is implicitly conferring some importance to page v - in other words, for some user k, we can use a prior distribution pk(cj ) that re ects the interests of user k. this method provides an alternative framework for user-based personalization, rather than directly varying the damping vector ~p as had been suggested in [***, 6]. using a text index, we retrieve urls for all documents containing the original query terms q - j p(cj jq0)  rankjd (8) the results are ranked according to this composite score sqd. 7 the above query-sensitive pagerank computation has the following probabilistic interpretation, in terms of the \random surfer" model [***]. let wj be the coefficient used to weight the jth rank vector, with fi j wj = 1 (e influence:1 type:1 pair index:86 citer id:68 citer title:topic-sensitive pagerank citer abstract:in the original pagerank algorithm for improving the ranking of search-query results, a single pagerank vector is computed, using the link structure of the web, to capture the relative \importance" ofweb pages, independent of any particular search query. to yield more accurate search results, we propose computing a set of pagerank vectors, biased using a set of representative topics, to capture more accurately the notion of importance with respect to a particular topic. by using these (precomputed) biased pagerank vectors to generate query-specific importance scores for pages at query time, we show that we can generate more accurate rankings than with a single, generic pagerank vector. for ordinary keyword search queries, we compute the topic-sensitive pagerank scores for pages satisfying the query using the topic of the query keywords. for searches done in context (e.g., when the search query is performed by highlighting words in a web page), we compute the topic-sensitive pagerank scores using the topic of the context in which the query appeared citee id:17 citee title:automatic resource compilation by analyzing hyperlink structure and associated text citee abstract:we describe the design, prototyping and evaluation of arc, a system for automatically compiling a list of authoritative web resources on any (sufficiently broad) topic. the goal of arc is to compile resource lists similar to those provided by yahoo! or infoseek. the fundamental difference is that these services construct lists either manually or through a combination of human and automated effort, while arc operates fully automatically. we describe the evaluation of arc, yahoo!, and infoseek resource lists by a panel of human users. this evaluation suggests that the resources found by arc frequently fare almost as well as, and sometimes better than, lists of resources that are manually compiled or classified into a topic. we also provide examples of arc resource lists for the reader to examine. surrounding text:[4] augments the hits algorithm with content analysis to improve precision for the task of retrieving documents related to a query topic (as opposed to retrieving documents that exactly satisfy the user's information need). [***] makes use of hits for automatically compiling resource lists for general topics. the pagerank algorithm discussed in [7, 16] precomputes a rank vector that provides a-priori \importance" estimates for all of the pages on the web influence:2 type:2 pair index:87 citer id:68 citer title:topic-sensitive pagerank citer abstract:in the original pagerank algorithm for improving the ranking of search-query results, a single pagerank vector is computed, using the link structure of the web, to capture the relative \importance" ofweb pages, independent of any particular search query. to yield more accurate search results, we propose computing a set of pagerank vectors, biased using a set of representative topics, to capture more accurately the notion of importance with respect to a particular topic. by using these (precomputed) biased pagerank vectors to generate query-specific importance scores for pages at query time, we show that we can generate more accurate rankings than with a single, generic pagerank vector. for ordinary keyword search queries, we compute the topic-sensitive pagerank scores for pages satisfying the query using the topic of the query keywords. for searches done in context (e.g., when the search query is performed by highlighting words in a web page), we compute the topic-sensitive pagerank scores using the topic of the context in which the query appeared citee id:133 citee title:rank aggregation methods for the web citee abstract:we consider the problem of combining ranking results from various sources. in the context of the web, the main applications include building meta-search engines, combining ranking functions, selecting documents based on multiple criteria, and improving search precision through word associations. we develop a set of techniques for the rank aggregation problem and compare their performance to that of well-known methods. a primary goal of our work is to design rank aggregation techniques that can e ectively combat \spam," a serious problem in web searches. experiments show that our methods are simple, efficient, and e ective surrounding text:our crawl contained roughly 280,000 of the 3 million urls in the odp. for our experiments, we used 35 of the sample queries given in [***], which were in turn compiled from earlier papers. 8 the queries are listed in table 1 - therefore, we also use a variant of the kendall's  distance measure. see [***] for a discussion of various distance measures for ranked lists in the context of web search results. for consistency with osim, we will present our definition as a similarity (as opposed to distance) measure, so that values closer to 1 indicate closer agreement influence:2 type:3 pair index:88 citer id:68 citer title:topic-sensitive pagerank citer abstract:in the original pagerank algorithm for improving the ranking of search-query results, a single pagerank vector is computed, using the link structure of the web, to capture the relative \importance" ofweb pages, independent of any particular search query. to yield more accurate search results, we propose computing a set of pagerank vectors, biased using a set of representative topics, to capture more accurately the notion of importance with respect to a particular topic. by using these (precomputed) biased pagerank vectors to generate query-specific importance scores for pages at query time, we show that we can generate more accurate rankings than with a single, generic pagerank vector. for ordinary keyword search queries, we compute the topic-sensitive pagerank scores for pages satisfying the query using the topic of the query keywords. for searches done in context (e.g., when the search query is performed by highlighting words in a web page), we compute the topic-sensitive pagerank scores using the topic of the context in which the query appeared citee id:128 citee title:placing search in context: the concept revisited citee abstract:keyword-based search engines are in widespread use today as a popular means for web-based information retrieval. although such systems seem deceptively simple, a considerable amount of skill is required in order to satisfy non-trivial information needs. this paper presents a new conceptual paradigm for performing search in context, that largely automates the search process, providing even non-professional users with highly relevant results. this paradigm is implemented in practice in the intellizap system, where search is initiated from a text query marked by the user in a document she views, and is guided by the text surrounding the marked query in that document ("the context"). the context-driven information retrieval process involves semantic keyword extraction and clustering to automatically generate new, augmented queries. the latter are submitted to a host of general and domain-specific search engines. search results are then semantically reranked, using context. experimental results testify that using context to guide search, effectively offers even inexperienced users an advanced search tool on the web. surrounding text:in the second scenario, we assume the user is viewing a document (for instance, browsing the web or reading email), and selects a term from the document for which he would like more information. this notion of search in context is discussed in [***]. for instance, if a query for \architecture" is performed by highlighting a term in a document discussing famous building architects, we would like the result to be di erent than if the query \architecture" is performed by highlighting a term in a document on cpu design influence:3 type:3 pair index:89 citer id:68 citer title:topic-sensitive pagerank citer abstract:in the original pagerank algorithm for improving the ranking of search-query results, a single pagerank vector is computed, using the link structure of the web, to capture the relative \importance" ofweb pages, independent of any particular search query. to yield more accurate search results, we propose computing a set of pagerank vectors, biased using a set of representative topics, to capture more accurately the notion of importance with respect to a particular topic. by using these (precomputed) biased pagerank vectors to generate query-specific importance scores for pages at query time, we show that we can generate more accurate rankings than with a single, generic pagerank vector. for ordinary keyword search queries, we compute the topic-sensitive pagerank scores for pages satisfying the query using the topic of the query keywords. for searches done in context (e.g., when the search query is performed by highlighting words in a web page), we compute the topic-sensitive pagerank scores using the topic of the context in which the query appeared citee id:63 citee title:efficient computation of pagerank citee abstract:this paper discusses efficient techniques for computing pagerank, a ranking metric for hypertext documents. we show that pagerank can be computed for very large subgraphs of the web (up to hundreds of millions of nodes) on machines with limited main memory. running-time measurements on various memory configurations are presented for pagerank computation over the 24-million-page stanford webbase archive. we discuss several methods for analyzing the convergence of pagerank based on the induced ordering of the pages. we present convergence results helpful for determining the number of iterations necessary to achieve a useful pagerank assignment, both in the absence and presence of search queries. surrounding text:2. review of pagerank a review of the pagerank algorithm ([16, 7, ***]) follows. the basic idea of pagerank is that if page u has a link to page v, then the author of u is implicitly conferring some importance to page v influence:1 type:1 pair index:90 citer id:68 citer title:topic-sensitive pagerank citer abstract:in the original pagerank algorithm for improving the ranking of search-query results, a single pagerank vector is computed, using the link structure of the web, to capture the relative \importance" ofweb pages, independent of any particular search query. to yield more accurate search results, we propose computing a set of pagerank vectors, biased using a set of representative topics, to capture more accurately the notion of importance with respect to a particular topic. by using these (precomputed) biased pagerank vectors to generate query-specific importance scores for pages at query time, we show that we can generate more accurate rankings than with a single, generic pagerank vector. for ordinary keyword search queries, we compute the topic-sensitive pagerank scores for pages satisfying the query using the topic of the query keywords. for searches done in context (e.g., when the search query is performed by highlighting words in a web page), we compute the topic-sensitive pagerank scores using the topic of the context in which the query appeared citee id:69 citee title:scaling personalized web search citee abstract:recent web search techniques augment traditional text matching with a global notion of ``importance'' based on the linkage structure of the web, such as in google's pagerank algorithm. for more refined searches, this global notion of importance can be specialized to create personalized views of importance--for example, importance scores can be biased according to a user-specified set of initially-interesting pages. computing and storing all possible personalized views in advance is impractical, as is computing personalized views at query time, since the computation of each view requires an iterative computation over the web graph. we present new graph-theoretical results, and a new technique based on these results, that encode personalized views as partial vectors. partial vectors are shared across multiple personalized views, and their computation and storage costs scale well with the number of views. our approach enables incremental computation, so that the construction of personalized views from partial vectors is practical at query time. we present efficient dynamic programming algorithms for computing partial vectors, an algorithm for constructing personalized views from partial vectors, and experimental results demonstrating the effectiveness and scalability of our techniques. surrounding text:however, a finegrained set of topics leads to efficiency considerations, as the cost of the naive approach to computing these topic-sensitive vectors is linear in the number of basis topics. see [***] for approaches that may make the use of a larger, finer grained set of basis topics practical. we are also currently investigating a di erent approach to creating the damping vector ~p used to create the topicsensitive rank vectors influence:3 type:3 pair index:91 citer id:68 citer title:topic-sensitive pagerank citer abstract:in the original pagerank algorithm for improving the ranking of search-query results, a single pagerank vector is computed, using the link structure of the web, to capture the relative \importance" ofweb pages, independent of any particular search query. to yield more accurate search results, we propose computing a set of pagerank vectors, biased using a set of representative topics, to capture more accurately the notion of importance with respect to a particular topic. by using these (precomputed) biased pagerank vectors to generate query-specific importance scores for pages at query time, we show that we can generate more accurate rankings than with a single, generic pagerank vector. for ordinary keyword search queries, we compute the topic-sensitive pagerank scores for pages satisfying the query using the topic of the query keywords. for searches done in context (e.g., when the search query is performed by highlighting words in a web page), we compute the topic-sensitive pagerank scores using the topic of the context in which the query appeared citee id:13 citee title:authoritative sources in a hyperlinked environment citee abstract:the network structure of a hyperlinked environment can be a rich source of information about the content of the environment, provided we have e ective means for understanding it. we develop a set of algorithmic tools for extracting information from the link structures of such environments, and report on experiments that demonstrate their e ectiveness in a variety of contexts on the world wide web. the central issue we address within our framework is the distillation of broad search topics, through the discovery of \authoritative" information sources on such topics. we propose and test an algorithmic formulation of the notion of authority, based on the relationship between a set of relevant authoritative pages and the set of \hub pages" that join them together in the link structure. our formulation has connections to the eigenvectors of certain matrices associated with the link graph; these connections in turn motivate additional heuristics for link-based analysis. surrounding text:acm 1581134495/ 02/0005. algorithm proposed in [***] relies on query-time processing to deduce the hubs and authorities that exist in a subgraph of the web consisting of both the results to a query and the local neighborhood of these results. [4] augments the hits algorithm with content analysis to improve precision for the task of retrieving documents related to a query topic (as opposed to retrieving documents that exactly satisfy the user's information need) influence:2 type:1 pair index:92 citer id:68 citer title:topic-sensitive pagerank citer abstract:in the original pagerank algorithm for improving the ranking of search-query results, a single pagerank vector is computed, using the link structure of the web, to capture the relative \importance" ofweb pages, independent of any particular search query. to yield more accurate search results, we propose computing a set of pagerank vectors, biased using a set of representative topics, to capture more accurately the notion of importance with respect to a particular topic. by using these (precomputed) biased pagerank vectors to generate query-specific importance scores for pages at query time, we show that we can generate more accurate rankings than with a single, generic pagerank vector. for ordinary keyword search queries, we compute the topic-sensitive pagerank scores for pages satisfying the query using the topic of the query keywords. for searches done in context (e.g., when the search query is performed by highlighting words in a web page), we compute the topic-sensitive pagerank scores using the topic of the context in which the query appeared citee id:91 citee title:pagerank: bringing order to the web citee abstract:the importance of a web page is an inherently subjective matter, which depends on the readers interests, knowledge and attitudes. but there is still much that can be said objectively about the relative importance of web pages. this paper describes pagerank, a method for rating web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them. we compare pagerank to an idealized random web surfer. we show how to efficiently compute pagerank for large surrounding text:[8] makes use of hits for automatically compiling resource lists for general topics. the pagerank algorithm discussed in [7, ***] precomputes a rank vector that provides a-priori \importance" estimates for all of the pages on the web. this vector is computed once, oine, and is independent of the search query - 2. review of pagerank a review of the pagerank algorithm ([***, 7, 11]) follows. the basic idea of pagerank is that if page u has a link to page v, then the author of u is implicitly conferring some importance to page v influence:1 type:1 pair index:93 citer id:68 citer title:topic-sensitive pagerank citer abstract:in the original pagerank algorithm for improving the ranking of search-query results, a single pagerank vector is computed, using the link structure of the web, to capture the relative \importance" ofweb pages, independent of any particular search query. to yield more accurate search results, we propose computing a set of pagerank vectors, biased using a set of representative topics, to capture more accurately the notion of importance with respect to a particular topic. by using these (precomputed) biased pagerank vectors to generate query-specific importance scores for pages at query time, we show that we can generate more accurate rankings than with a single, generic pagerank vector. for ordinary keyword search queries, we compute the topic-sensitive pagerank scores for pages satisfying the query using the topic of the query keywords. for searches done in context (e.g., when the search query is performed by highlighting words in a web page), we compute the topic-sensitive pagerank scores using the topic of the context in which the query appeared citee id:71 citee title:the intelligent surfer: probabilistic combination of link and content information in pagerank citee abstract:the pagerank algorithm, used in the google search engine, greatly improves the results of web search by taking into account the link structure of the web. pagerank assigns to a page a score proportional to the number of times a random surfer would visit that page, if it surfed indefinitely from page to page, following all outlinks from a page with equal probability. we propose to improve pagerank by using a more intelligent surfer, one that is guided by a probabilistic model of the relevance of a page to a query. efficient execution of our algorithm at query time is made possible by precomputing at crawl time (and thus once for all queries) the necessary terms. experiments on two large subsets of the web indicate that our algorithm significantly outperforms pagerank in the (humanrated) quality of the pages returned, while remaining efficient enough to be used in todays large search engines. surrounding text:[17] proposes using the set ofweb pages that contain some term as a bias set for in uencing the pagerank computation, with the goal of returning terms for which a given page has a high reputation. an approach for enhancing search rankings by generating a pagerank vector for each possible query term was recently proposed in [***] with favorable results. however, the approach requires considerable processing time and storage, and is not easily extended to make use of user and query context influence:1 type:2 pair index:94 citer id:70 citer title:optimizing search engines using clickthrough data citer abstract:this paper presents an approach to automatically optimizing the retrieval quality of search engines using clickthrough data. intuitively, a good information retrieval system should present relevant documents high in the ranking, with less relevant documents following below. while previous approaches to learning retrieval functions from examples exist, they typically require training data generated from relevance judgments by experts. this makes them difficult and expensive to apply. the goal of this paper is to develop a method that utilizes clickthrough data for training, namely the query-log of the search engine in connection with the log of links the users clicked on in the presented ranking. such clickthrough data is available in abundance and can be recorded at very low cost. taking a support vector machine (svm) approach, this paper presents a method for learning retrieval functions. from a theoretical perspective, this method is shown to be well-founded in a risk minimization framework. furthermore, it is shown to be feasible even for large sets of queries and features. the theoretical results are verified in a controlled experiment. it shows that the method can effectively adapt the retrieval function of a meta-search engine to a particular group of users, outperforming google in terms of retrieval quality after only a couple of hundred training examples citee id:8 citee title:an efficient boosting algorithm for combining preferences citee abstract:we study the problem of learning to accurately rank a set of objects by combining a given collection of ranking or preference functions. this problem of combining preferences arises in several applications, such as that of combining the results of different search engines, or the collaborativefiltering problem of ranking movies for a user based on the movie rankings provided by other users. in this work, we begin by presenting a formal framework for this general problem. we then describe and analyze an efficient algorithm called rankboost for combining preferences based on the boosting approach to machine learning. we give theoretical results describing the algorithms behavior both on the training data, and on new test data not seen during training. we also describe an efficient implementation of the algorithm for a particular restricted but common case. we next discuss two experiments we carried out to assess the performance of rankboost. in the first experiment, we used the algorithm to combine different web search strategies, each of which is a query expansion for a given domain. the second experiment is a collaborative-filtering task for making movie recommendations. surrounding text:the boosting algorithm of freund et al. [***] is an approach to combining many weak ranking rules into a strong ranking functions. while they also (approximately) minimize the number of inversions, they do not explicitly consider a distribution over queries and target rankings influence:2 type:2 pair index:95 citer id:70 citer title:optimizing search engines using clickthrough data citer abstract:this paper presents an approach to automatically optimizing the retrieval quality of search engines using clickthrough data. intuitively, a good information retrieval system should present relevant documents high in the ranking, with less relevant documents following below. while previous approaches to learning retrieval functions from examples exist, they typically require training data generated from relevance judgments by experts. this makes them difficult and expensive to apply. the goal of this paper is to develop a method that utilizes clickthrough data for training, namely the query-log of the search engine in connection with the log of links the users clicked on in the presented ranking. such clickthrough data is available in abundance and can be recorded at very low cost. taking a support vector machine (svm) approach, this paper presents a method for learning retrieval functions. from a theoretical perspective, this method is shown to be well-founded in a risk minimization framework. furthermore, it is shown to be feasible even for large sets of queries and features. the theoretical results are verified in a controlled experiment. it shows that the method can effectively adapt the retrieval function of a meta-search engine to a particular group of users, outperforming google in terms of retrieval quality after only a couple of hundred training examples citee id:125 citee title:webwatcher: a tour guide for the world wide web citee abstract:we explore the notion of a tour guide software agent for assisting users browsing the world wide web. a web tour guide agent provides assistance similar to that provided by a human tour guide in a museum-- it guides the user along an appropriate path through the collection, based on its knowledge of the user's interests, of the location and relevance of various items in the collection, and of the way in which others have interacted with the collection in the past. this paper describes a simple but operational tour guide, called webwatcher, which has given over 5000 tours to people browsing cmu's school of computer science web pages. webwatcher accompanies users from page to page, suggests appropriate hyperlinks, and learns from experience to improve its advice-giving skills. we describe the learning algorithms used by webwatcher, experimental results showing their effectiveness, and lessons learned from this case study in web tour guide agents surrounding text:an interesting question is whether such an online algorithm can also be used to solve the optimization problem connected to the ranking svm. some attempts have been made to use implicit feedback by observing clicking behavior in retrieval systems [5] and browsing assistants [***][20]. however, the semantics of the learning process and its results are unclear as demonstrated in section 2 influence:2 type:2 pair index:96 citer id:70 citer title:optimizing search engines using clickthrough data citer abstract:this paper presents an approach to automatically optimizing the retrieval quality of search engines using clickthrough data. intuitively, a good information retrieval system should present relevant documents high in the ranking, with less relevant documents following below. while previous approaches to learning retrieval functions from examples exist, they typically require training data generated from relevance judgments by experts. this makes them difficult and expensive to apply. the goal of this paper is to develop a method that utilizes clickthrough data for training, namely the query-log of the search engine in connection with the log of links the users clicked on in the presented ranking. such clickthrough data is available in abundance and can be recorded at very low cost. taking a support vector machine (svm) approach, this paper presents a method for learning retrieval functions. from a theoretical perspective, this method is shown to be well-founded in a risk minimization framework. furthermore, it is shown to be feasible even for large sets of queries and features. the theoretical results are verified in a controlled experiment. it shows that the method can effectively adapt the retrieval function of a meta-search engine to a particular group of users, outperforming google in terms of retrieval quality after only a couple of hundred training examples citee id:113 citee title:letizia: an agent that assists web browsing citee abstract:letizia is a user interface agent that assists a user browsing the world wide web. as the user operates a conventional web browser such as netscape, the agent tracks user behavior and attempts to anticipate items of interest by doing concurrent, autonomous exploration of links from the user's current position. the agent automates a browsing strategy consisting of a best-first search augmented by heuristics inferring user interest from browsing behavior. 1 introduction "letizia lvarez de surrounding text:an interesting question is whether such an online algorithm can also be used to solve the optimization problem connected to the ranking svm. some attempts have been made to use implicit feedback by observing clicking behavior in retrieval systems [5] and browsing assistants [17][***]. however, the semantics of the learning process and its results are unclear as demonstrated in section 2 influence:2 type:2 pair index:97 citer id:70 citer title:optimizing search engines using clickthrough data citer abstract:this paper presents an approach to automatically optimizing the retrieval quality of search engines using clickthrough data. intuitively, a good information retrieval system should present relevant documents high in the ranking, with less relevant documents following below. while previous approaches to learning retrieval functions from examples exist, they typically require training data generated from relevance judgments by experts. this makes them difficult and expensive to apply. the goal of this paper is to develop a method that utilizes clickthrough data for training, namely the query-log of the search engine in connection with the log of links the users clicked on in the presented ranking. such clickthrough data is available in abundance and can be recorded at very low cost. taking a support vector machine (svm) approach, this paper presents a method for learning retrieval functions. from a theoretical perspective, this method is shown to be well-founded in a risk minimization framework. furthermore, it is shown to be feasible even for large sets of queries and features. the theoretical results are verified in a controlled experiment. it shows that the method can effectively adapt the retrieval function of a meta-search engine to a particular group of users, outperforming google in terms of retrieval quality after only a couple of hundred training examples citee id:126 citee title:term weighting approaches in automatic text retrieval citee abstract:the experimental evidence accumulated over the past 20 years indicates that textindexing systems based on the assignment of appropriately weighted single terms produce retrieval results that are superior to those obtainable with other more elaborate text representations. these results depend crucially on the choice of effective term weighting systems. this paper summarizes the insights gained in automatic term weighting, and provides baseline single term indexing models with which other more elaborate content analysis procedures can be compared surrounding text:04 0. 92 table 1: average clickrank for three retrieval functions (bxx, tfc [***] , and a hand-tuned strategy that uses different weights according to html tags) implemented in laser. rows correspond to the retrieval method used by laser at query time; columns hold values from subsequent evaluation with other methods influence:3 type:3 pair index:98 citer id:86 citer title:improved algorithms for topic distillation in a hyperlinked environment citer abstract:this paper addresses the problem of topic distillation on theworld wideweb, namely, given a typical user query to find quality documents related to the query topic. connectivity analysis has been shown to be useful in identifying high quality pages within a topic specific graph of hyperlinked documents. the essence of our approach is to augment a previous connectivity analysis based algorithm with content analysis. we identify three problems with the existing approach and devise algorithms to tackle them. the results of a user evaluation are reported that show an improvement of precision at 10 documents by at least 45% over pure connectivity analysis citee id:87 citee title:the connectivity server: fast access to linkage information on the web citee abstract:we have built a server that provides linkage information for all pages indexed by the altavista search engine. in its basic operation, the server accepts a query consisting of a set l of one or more urls and returns a list of all pages that point to pages in l (predecessors) and a list of all pages that are pointed to from pages in l (successors). more generally the server can produce the entire neighbourhood (in the graph theory sense) of l up to a given distance and can include information about all links that exist among pages in the neighbourhood. although some of this information can be retrieved directly from alta vista or other search engines, these engines are not optimized for this purpose and the process of constructing the neighbourhood of a given set of pages is slow and laborious. in contrast our prototype server needs less than 0.1 ms per result url. so far we have built two applications that use the connectivity server: a direct interface that permits fast navigation of the web via the predecessor/successor relation, and a visualization tool for the neighbourhood of a given set of pages. we envisage numerous other applications such as ranking, visualization, and classification. surrounding text:with a download rate of 1 document per second queries takes about 30 minutes. to get fast access to linkage information within the world wide web, we built a connectivity server [***] that provides linkage information for all pages indexed by the altavista search engine. the server provides a specialized interface to compute the neighborhood graph for a set of urls influence:3 type:3 pair index:99 citer id:86 citer title:improved algorithms for topic distillation in a hyperlinked environment citer abstract:this paper addresses the problem of topic distillation on theworld wideweb, namely, given a typical user query to find quality documents related to the query topic. connectivity analysis has been shown to be useful in identifying high quality pages within a topic specific graph of hyperlinked documents. the essence of our approach is to augment a previous connectivity analysis based algorithm with content analysis. we identify three problems with the existing approach and devise algorithms to tackle them. the results of a user evaluation are reported that show an improvement of precision at 10 documents by at least 45% over pure connectivity analysis citee id:17 citee title:automatic resource compilation by analyzing hyperlink structure and associated text citee abstract:we describe the design, prototyping and evaluation of arc, a system for automatically compiling a list of authoritative web resources on any (sufficiently broad) topic. the goal of arc is to compile resource lists similar to those provided by yahoo! or infoseek. the fundamental difference is that these services construct lists either manually or through a combination of human and automated effort, while arc operates fully automatically. we describe the evaluation of arc, yahoo!, and infoseek resource lists by a panel of human users. this evaluation suggests that the resources found by arc frequently fare almost as well as, and sometimes better than, lists of resources that are manually compiled or classified into a topic. we also provide examples of arc resource lists for the reader to examine. surrounding text:the last problem is by far the most common, and our general solution is to use content analysis to help keep the connectivity-based computation \on the topic. " we compare the performance of 10 algorithms with the basic kleinberg algorithm on 28 topics that were used previously in [***]. the best approach increases the precision over basic kleinberg by at least 45% and takes less than 3 minutes - we used relative recall instead of recall since we do not know the number of relevant documents for a topic on the web, or even in the neighborhood graph. we used a set of 28 queries previously used by [***] in comparing the rankings from their version of kleinberg's algorithm with category listings on the web. table 1 gives a listing of the queries ordered by the number of results returned by altavista in december 1997 for each query, which can be taken as a measure of the topic's popularity on the web - in terms of relative recall, compared with the best previous algorithm, selective pruning performed comparably for authority documents, and about 10% worse for hub documents. 7 related work the arc algorithm of chakrabarti et al [***] also extends kleinberg's algorithm with textual analysis. arc computes a distance-2 neighborhood graph and weights edges influence:1 type:2 pair index:100 citer id:86 citer title:improved algorithms for topic distillation in a hyperlinked environment citer abstract:this paper addresses the problem of topic distillation on theworld wideweb, namely, given a typical user query to find quality documents related to the query topic. connectivity analysis has been shown to be useful in identifying high quality pages within a topic specific graph of hyperlinked documents. the essence of our approach is to augment a previous connectivity analysis based algorithm with content analysis. we identify three problems with the existing approach and devise algorithms to tackle them. the results of a user evaluation are reported that show an improvement of precision at 10 documents by at least 45% over pure connectivity analysis citee id:88 citee title:search engines for the world wide web: a comparative study and evaluation methodology citee abstract:three web search engines, namely, alta vista, excite, and lycos, were compared and evaluated in terms of their search capabilities (e.g., boolean logic, truncation, field search, word and phrase search) and retrieval performances (i.e., precision and response time) using sample queries drawn from real reference questions. recall, the other evaluation criterion of information retrieval, is deliberately omitted from this study because it is impossible to assume how many relevant items there are for a particular query in the huge and ever changing web system. the authors of this study found that alta vista outperformed excite and lycos in both search facilities and retrieval performance although lycos had the largest coverage of web resources among the three web search engines examined. as a result of this research, we also proposed a methodology for evaluating other web search engines not included in the current study. surrounding text:g. , search service comparisons [***, 11]). 8 conclusions in this paper we showed that kleinberg's connectivity analysis has three problems influence:3 type:3 pair index:101 citer id:86 citer title:improved algorithms for topic distillation in a hyperlinked environment citer abstract:this paper addresses the problem of topic distillation on theworld wideweb, namely, given a typical user query to find quality documents related to the query topic. connectivity analysis has been shown to be useful in identifying high quality pages within a topic specific graph of hyperlinked documents. the essence of our approach is to augment a previous connectivity analysis based algorithm with content analysis. we identify three problems with the existing approach and devise algorithms to tackle them. the results of a user evaluation are reported that show an improvement of precision at 10 documents by at least 45% over pure connectivity analysis citee id:88 citee title:search engines for the world wide web: a comparative study and evaluation methodology citee abstract:three web search engines, namely, alta vista, excite, and lycos, were compared and evaluated in terms of their search capabilities (e.g., boolean logic, truncation, field search, word and phrase search) and retrieval performances (i.e., precision and response time) using sample queries drawn from real reference questions. recall, the other evaluation criterion of information retrieval, is deliberately omitted from this study because it is impossible to assume how many relevant items there are for a particular query in the huge and ever changing web system. the authors of this study found that alta vista outperformed excite and lycos in both search facilities and retrieval performance although lycos had the largest coverage of web resources among the three web search engines examined. as a result of this research, we also proposed a methodology for evaluating other web search engines not included in the current study. surrounding text:g. , search service comparisons [7, ***]). 8 conclusions in this paper we showed that kleinberg's connectivity analysis has three problems influence:3 type:3 pair index:102 citer id:86 citer title:improved algorithms for topic distillation in a hyperlinked environment citer abstract:this paper addresses the problem of topic distillation on theworld wideweb, namely, given a typical user query to find quality documents related to the query topic. connectivity analysis has been shown to be useful in identifying high quality pages within a topic specific graph of hyperlinked documents. the essence of our approach is to augment a previous connectivity analysis based algorithm with content analysis. we identify three problems with the existing approach and devise algorithms to tackle them. the results of a user evaluation are reported that show an improvement of precision at 10 documents by at least 45% over pure connectivity analysis citee id:89 citee title:towards interactive query expansion citee abstract:in an era of online retrieval, it is appropriate to offer guidance to users wishing to improve their initial queries. one form of such guidance could be short lists of suggested terms gathered from feedback, nearest neighbors, and term variants of original query terms. to verify this approach, a series of experiments were run using the cranfield test collection to discover techniques to select terms for these lists that would be effective for further retrieval. the results show that significant improvement can be expected from this approach to query expansion surrounding text:g. , [24, ***]). on the web there are examples of topic hierarchies (e influence:3 type:2 pair index:103 citer id:86 citer title:improved algorithms for topic distillation in a hyperlinked environment citer abstract:this paper addresses the problem of topic distillation on theworld wideweb, namely, given a typical user query to find quality documents related to the query topic. connectivity analysis has been shown to be useful in identifying high quality pages within a topic specific graph of hyperlinked documents. the essence of our approach is to augment a previous connectivity analysis based algorithm with content analysis. we identify three problems with the existing approach and devise algorithms to tackle them. the results of a user evaluation are reported that show an improvement of precision at 10 documents by at least 45% over pure connectivity analysis citee id:13 citee title:authoritative sources in a hyperlinked environment citee abstract:the network structure of a hyperlinked environment can be a rich source of information about the content of the environment, provided we have e ective means for understanding it. we develop a set of algorithmic tools for extracting information from the link structures of such environments, and report on experiments that demonstrate their e ectiveness in a variety of contexts on the world wide web. the central issue we address within our framework is the distillation of broad search topics, through the discovery of \authoritative" information sources on such topics. we propose and test an algorithmic formulation of the notion of authority, based on the relationship between a set of relevant authoritative pages and the set of \hub pages" that join them together in the link structure. our formulation has connections to the eigenvectors of certain matrices associated with the link graph; these connections in turn motivate additional heuristics for link-based analysis. surrounding text:if a is seen to point to a lot of good documents, then a's opinion becomes more valuable, and the fact that a points to b would suggest that b is a good document as well. using this basic idea, kleinberg [***] developed a connectivity analysis algorithm for hyperlinked environments. given an initial set of results from a search service, the algorithm extracts a subgraph from the web containing the result set and its neighboring documents - 2 connectivity analysis the goal of connectivity analysis is to exploit linkage information between documents, based on the assumption that a link between two documents implies that the documents contain related content (assumption i), and that if the documents were authored by di erent people then the first author found the second document valuable (assumption ii). in 1997 kleinberg [***] published an algorithm for connectivity analysis on the world wide web which we describe next. 2 - (4) while the vectors h and a have not converged: (5) for all n in n, a[n] :=p(n0;n)2n h[n0] (6) for all n in n, h[n] :=p(n;n0)2n a[n0] (7) normalize the h and a vectors. kleinberg [***] proved that the h and a vectors will eventually converge, i. e - note that the algorithm does not claim to find all relevant pages, since there may be some that have good content but have not been linked to by many authors. in our evaluation of di erent algorithms we use kleinberg's algorithm [***] as our baseline, which we call base. 2 - one way of obtaining inlinks is to use altavista queries of the form link : u, which returns a list of documents that point to the url u. this was the implementation used by [***]. in our queries, the neighborhood graph contained on the order of 2000 nodes influence:2 type:2 pair index:104 citer id:86 citer title:improved algorithms for topic distillation in a hyperlinked environment citer abstract:this paper addresses the problem of topic distillation on theworld wideweb, namely, given a typical user query to find quality documents related to the query topic. connectivity analysis has been shown to be useful in identifying high quality pages within a topic specific graph of hyperlinked documents. the essence of our approach is to augment a previous connectivity analysis based algorithm with content analysis. we identify three problems with the existing approach and devise algorithms to tackle them. the results of a user evaluation are reported that show an improvement of precision at 10 documents by at least 45% over pure connectivity analysis citee id:20 citee title:bibliometrics of the world wide web: an exploratory analysis of the intellectual structure of cyberspace citee abstract:this exploratory study examines the explosive growth and the "bibliometrics" of the world wide web based on both analysis of over 30 gigabytes of web pages collected by the inktomi "web crawler" and on the use of the dec altavista search engine for cocitation analysis of a set of earth science related www sites. the statistical characteristics of web documents and their hypertext links are examined, along with examination of the characteristics of highly cited web documents surrounding text:this is used to discover influential publications and authors with similar interests within the articles of a certain field of study. see [***] for a discussion on applying bibliometrics to the world wide web. citation analysis has been criticized (see [8]) as a source of systematic bias, since members of cliquish communities tend to without regulation with regulation partial base imp med startmed maxby10 impr medr startmedr maxby10r pca0 pca1 authorities at 5 0 influence:2 type:2 pair index:105 citer id:86 citer title:improved algorithms for topic distillation in a hyperlinked environment citer abstract:this paper addresses the problem of topic distillation on theworld wideweb, namely, given a typical user query to find quality documents related to the query topic. connectivity analysis has been shown to be useful in identifying high quality pages within a topic specific graph of hyperlinked documents. the essence of our approach is to augment a previous connectivity analysis based algorithm with content analysis. we identify three problems with the existing approach and devise algorithms to tackle them. the results of a user evaluation are reported that show an improvement of precision at 10 documents by at least 45% over pure connectivity analysis citee id:90 citee title:interfaces for end-user information seeking citee abstract:discusses essential features of interfaces to support end-user information seeking. highlights include cognitive engineering; task models and task analysis; the problem-solving nature of information seeking; examples of systems for end-users, including online public access catalogs (opacs), hypertext, and help systems; and suggested research directions surrounding text:1 introduction search services on the world wide web are the information retrieval systems that most people are familiar with. as argued by marchionini [***] \end users want to achieve their goals with a minimum of cognitive load and a maximum of enjoyment. " correspondingly, in the context of web searches we observe that users tend to type short queries (one to three words) [2, 9], without giving much thought to query formulation influence:3 type:3 pair index:106 citer id:86 citer title:improved algorithms for topic distillation in a hyperlinked environment citer abstract:this paper addresses the problem of topic distillation on theworld wideweb, namely, given a typical user query to find quality documents related to the query topic. connectivity analysis has been shown to be useful in identifying high quality pages within a topic specific graph of hyperlinked documents. the essence of our approach is to augment a previous connectivity analysis based algorithm with content analysis. we identify three problems with the existing approach and devise algorithms to tackle them. the results of a user evaluation are reported that show an improvement of precision at 10 documents by at least 45% over pure connectivity analysis citee id:91 citee title:pagerank: bringing order to the web citee abstract:the importance of a web page is an inherently subjective matter, which depends on the readers interests, knowledge and attitudes. but there is still much that can be said objectively about the relative importance of web pages. this paper describes pagerank, a method for rating web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them. we compare pagerank to an idealized random web surfer. we show how to efficiently compute pagerank for large surrounding text:pirolli et al [26] run a computation on a inter-document matrix, with weights derived from linkage, content similarity and usage data, to identify usable structures. pagerank [***] is a ranking algorithm for web documents that uses connectivity to compute a topic-independent score for each document. there has been much work in ir on supporting topic exploration influence:2 type:2 pair index:107 citer id:86 citer title:improved algorithms for topic distillation in a hyperlinked environment citer abstract:this paper addresses the problem of topic distillation on theworld wideweb, namely, given a typical user query to find quality documents related to the query topic. connectivity analysis has been shown to be useful in identifying high quality pages within a topic specific graph of hyperlinked documents. the essence of our approach is to augment a previous connectivity analysis based algorithm with content analysis. we identify three problems with the existing approach and devise algorithms to tackle them. the results of a user evaluation are reported that show an improvement of precision at 10 documents by at least 45% over pure connectivity analysis citee id:92 citee title:silk from a sow's ear: extracting usable structures from the web citee abstract:in its current implementation, the world-wide web lacks much of the explicit structure and strong typing found in many closed hypertext systems. while this property has directly fueled the explosive acceptance of the web, it further complicates the already difficult problem of identifying usable structures and aggregates in large hypertext collections. these reduced structures, or localities, form the basis to simplifying visualizations of and navigation through complex hypertext systems. much surrounding text:others have used inter-document linkage to compute useful data on the web as well. pirolli et al [***] run a computation on a inter-document matrix, with weights derived from linkage, content similarity and usage data, to identify usable structures. pagerank [25] is a ranking algorithm for web documents that uses connectivity to compute a topic-independent score for each document influence:2 type:2 pair index:108 citer id:106 citer title:learning block importance models for web pages citer abstract:some previous works show that a web page can be partitioned to multiple segments or blocks, and usually the importance of those blocks in a page is not equivalent. also, it is proved that differentiating noisy or unimportant blocks from pages can facilitate web mining, search and accessibility. but in these works, no uniform approach or model is presented to measure the importance of different portions in web pages. through a user study, we found that people do have a consistent view about the importance of blocks in web pages. in this paper, we investigate how to find a model to automatically assign importance values to blocks in a web page. we define the block importance estimation as a learning problem. first, we use the vips (vision-based page segmentation) algorithm to partition a web page into semantic blocks with a hierarchy structure. then spatial features (such as position, size) and content features (such as the number of images and links) are extracted to construct a feature vector for each block. based on these features, learning algorithms, such as svm and neural network, are applied to train various block importance models. in our experiments, the best model can achieve the performance with micro-f1 79% and micro-accuracy 85.9%, which is quite close to a persons citee id:107 citee title:template detection via data mining and its applications citee abstract:we formulate and propose the template detection problem, and suggest a practical solution for it based on counting frequent item sets. we show that the use of templates is pervasive on the web. we describe three principles, which characterize the assumptions made by hypertext information retrieval (ir) and data mining (dm) systems, and show that templates are a major source of violation of these principles. as a consequence, basic ``pure'' implementations of simple search algorithms coupled with template detection and elimination show surprising increases in precision at all levels of recall surrounding text:the common idea of these works is that in a given web site, noisy blocks usually share some common contents and presentation styles [15]. baryossef et al$ define the common parts among web pages as template [***]. when web pages are partitioned into some pagelets based on some rules, the problem of template detection is transformed to identify duplicated pagelets and count frequency - here we show several promising applications that may take advantage of the block importance model. block importance model is motivated by the urge to improve information retrieval performance, thus its direct application lies in this area [***][10]. search engine may benefit from block importance in two aspects: one is to improve the overall relevance rank of the returned pages after searching; the other is to locate the important regions carrying more information during searching - 9. references [***] bar-yossef, z. and rajagopalan, s influence:3 type:2 pair index:109 citer id:106 citer title:learning block importance models for web pages citer abstract:some previous works show that a web page can be partitioned to multiple segments or blocks, and usually the importance of those blocks in a page is not equivalent. also, it is proved that differentiating noisy or unimportant blocks from pages can facilitate web mining, search and accessibility. but in these works, no uniform approach or model is presented to measure the importance of different portions in web pages. through a user study, we found that people do have a consistent view about the importance of blocks in web pages. in this paper, we investigate how to find a model to automatically assign importance values to blocks in a web page. we define the block importance estimation as a learning problem. first, we use the vips (vision-based page segmentation) algorithm to partition a web page into semantic blocks with a hierarchy structure. then spatial features (such as position, size) and content features (such as the number of images and links) are extracted to construct a feature vector for each block. based on these features, learning algorithms, such as svm and neural network, are applied to train various block importance models. in our experiments, the best model can achieve the performance with micro-f1 79% and micro-accuracy 85.9%, which is quite close to a persons citee id:108 citee title:the anatomy of a large-scale hypertextual web search engine citee abstract:in this paper, we present google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. google is designed to crawl and index the web efficiently and produce much more satisfying search results than existing systems. the prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/ to engineer a search engine is a challenging task. search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. they answer tens of millions of queries every day. despite the importance of large-scale search engines on the web, very little academic research has been done on them. furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago. this paper provides an in-depth description of our large-scale web search engine -- the first such detailed public description we know of to date. apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. this paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext. also we look at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want surrounding text:within a single web page, it is also important to distinguish valuable information from noisy content that may mislead users attention. the former has been well studied by link analysis techniques such as pagerank [***], however, up to date, there is no effective technique for the latter aspect. most applications consider the whole web page as an atomic unit and treat different portions in a web page equally influence:3 type:3 pair index:110 citer id:106 citer title:learning block importance models for web pages citer abstract:some previous works show that a web page can be partitioned to multiple segments or blocks, and usually the importance of those blocks in a page is not equivalent. also, it is proved that differentiating noisy or unimportant blocks from pages can facilitate web mining, search and accessibility. but in these works, no uniform approach or model is presented to measure the importance of different portions in web pages. through a user study, we found that people do have a consistent view about the importance of blocks in web pages. in this paper, we investigate how to find a model to automatically assign importance values to blocks in a web page. we define the block importance estimation as a learning problem. first, we use the vips (vision-based page segmentation) algorithm to partition a web page into semantic blocks with a hierarchy structure. then spatial features (such as position, size) and content features (such as the number of images and links) are extracted to construct a feature vector for each block. based on these features, learning algorithms, such as svm and neural network, are applied to train various block importance models. in our experiments, the best model can achieve the performance with micro-f1 79% and micro-accuracy 85.9%, which is quite close to a persons citee id:80 citee title:function-based object model towards website adaptation citee abstract:content understanding is a crucial issue for website adaptation. in this paper we present a function-based object model (fom) that attempts to understand authors' intention by identifying object function instead of semantic understanding. every object in a website serves for certain functions (basic and specific function) which reflect authors' intention towards the purpose of an object. based on this consideration we have proposed the fom model for website understanding. fom includes two complementary parts: basic fom based on the basic functional properties of object and specific fom based on the category of object. an automatic approach to detect the functional properties and category of object is presented for fom generation. two level adaptation rules (general rules and specific rules) based on fom are combined for practical adaptation. a system for web content adaptation over wireless application protocol (wap) is developed as an application example of the proposed model. experiments have shown satisfactory results and extensibility surrounding text:there are several kinds of methods for web page segmentation. most popular ones are dom-based segmentation [***], location-based segmentation [9] and vision-based page segmentation (vips) [17][3]. these methods distinguish each other by considering various factors as the partition basis - [9] used visual information to build up a m-tree, and further defined heuristics to recognize common page areas such as header, left and right menu, footer and center of a page. in [***], a function model called fom is used to represent the relationships between features and functions. this work is close to our works, but it is rule-based and cannot deal with dozens of features with complicate correlations - 3. page segmentation several methods have been explored to segment a web page into regions or blocks [***][10]. in the dom-based segmentation approach, an html document is represented as a dom tree - there are also some works addressing the problem of block function identification. in [***], an automatic rule-based approach is presented to detect the functional property and category of object. however, this method is unstable and it is very difficult to manually compose rules in functions of dozens of features influence:2 type:2 pair index:111 citer id:106 citer title:learning block importance models for web pages citer abstract:some previous works show that a web page can be partitioned to multiple segments or blocks, and usually the importance of those blocks in a page is not equivalent. also, it is proved that differentiating noisy or unimportant blocks from pages can facilitate web mining, search and accessibility. but in these works, no uniform approach or model is presented to measure the importance of different portions in web pages. through a user study, we found that people do have a consistent view about the importance of blocks in web pages. in this paper, we investigate how to find a model to automatically assign importance values to blocks in a web page. we define the block importance estimation as a learning problem. first, we use the vips (vision-based page segmentation) algorithm to partition a web page into semantic blocks with a hierarchy structure. then spatial features (such as position, size) and content features (such as the number of images and links) are extracted to construct a feature vector for each block. based on these features, learning algorithms, such as svm and neural network, are applied to train various block importance models. in our experiments, the best model can achieve the performance with micro-f1 79% and micro-accuracy 85.9%, which is quite close to a persons citee id:81 citee title:geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition citee abstract:this paper develops the separating capacities of families of nonlinear decision surfaces by a direct application of a theorem in classical combinatorial geometry. it is shown that a family of surfaces having d degrees of freedom has a natural separating capacity of 2d pattern vectors, thus extending and unifying results of winder and others on the pattern-separating capacity of hyperplanes. applying these ideas to the vertices of a binary n-cube yields bounds on the number of spherically, quadratically, and, in general, nonlinearly separable boolean functions of n variables surrounding text:the output layer is linear, supplying the block importance given the low-level block representation applied to the input layer. a mathematical justification for the rationale of a nonlinear transformation followed by a linear transformation can be found in [***]. the function learned by rbf networks can be represented by = = h j i ij i f g 1 (x) w (x) where h is the number of hidden layer neurons, v ij - in other words, features in important blocks will be chosen or have higher weights than features in unimportant blocks. there also are a few works beginning to explore this interesting topic [***] [15][16]. block importance can also be applied to facilitate the web adaptation applications driven by the proliferation of small mobile devices [8] influence:3 type:3,1 pair index:112 citer id:106 citer title:learning block importance models for web pages citer abstract:some previous works show that a web page can be partitioned to multiple segments or blocks, and usually the importance of those blocks in a page is not equivalent. also, it is proved that differentiating noisy or unimportant blocks from pages can facilitate web mining, search and accessibility. but in these works, no uniform approach or model is presented to measure the importance of different portions in web pages. through a user study, we found that people do have a consistent view about the importance of blocks in web pages. in this paper, we investigate how to find a model to automatically assign importance values to blocks in a web page. we define the block importance estimation as a learning problem. first, we use the vips (vision-based page segmentation) algorithm to partition a web page into semantic blocks with a hierarchy structure. then spatial features (such as position, size) and content features (such as the number of images and links) are extracted to construct a feature vector for each block. based on these features, learning algorithms, such as svm and neural network, are applied to train various block importance models. in our experiments, the best model can achieve the performance with micro-f1 79% and micro-accuracy 85.9%, which is quite close to a persons citee id:73 citee title:error-correcting output codes: a general method for improving multiclass inductive learning programs citee abstract:multiclass learning problems involve finding a definition for an unknown function f(x) whose range is a discrete set containing k > 2 values (i.e., k "classes"). the definition is acquired by studying large collections of training examples of the form (xi, f(xi)). existing approaches to this problem include (a) direct application of multiclass algorithms such as the decision-tree algorithms id3 and cart, (b) application of binary concept learning algorithms to learn individual binary functions for each of the ic classes, and (c) application of binary concept learning algorithms with distributed output codes such as those employed by sejnowski and rosenberg in the nettalk system. this paper compares these three approaches to a new technique in which bch error-correcting codes are employed as a distributed output representation. we show that these output representations improve the performance of id3 on the nettalk task and of backpropagation on an isolated-letter speech-recognition task. these results demonstrate that error-correcting output codes provide a general-purpose method for improving the performance of inductive learning programs on multiclass problems. surrounding text:some typical kernel functions include polynomial kernel, gaussian rbf kernel, and sigmoid kernel. for multi-class classification problem, one can simply apply one-against-all scheme [6][***][12]. we use both the linear svm and nonlinear svm with gaussian rbf kernel to learn the block importance models in our experiments influence:2 type:3,1 pair index:113 citer id:106 citer title:learning block importance models for web pages citer abstract:some previous works show that a web page can be partitioned to multiple segments or blocks, and usually the importance of those blocks in a page is not equivalent. also, it is proved that differentiating noisy or unimportant blocks from pages can facilitate web mining, search and accessibility. but in these works, no uniform approach or model is presented to measure the importance of different portions in web pages. through a user study, we found that people do have a consistent view about the importance of blocks in web pages. in this paper, we investigate how to find a model to automatically assign importance values to blocks in a web page. we define the block importance estimation as a learning problem. first, we use the vips (vision-based page segmentation) algorithm to partition a web page into semantic blocks with a hierarchy structure. then spatial features (such as position, size) and content features (such as the number of images and links) are extracted to construct a feature vector for each block. based on these features, learning algorithms, such as svm and neural network, are applied to train various block importance models. in our experiments, the best model can achieve the performance with micro-f1 79% and micro-accuracy 85.9%, which is quite close to a persons citee id:50 citee title:discovering informative content blocks from web documents citee abstract:in this paper, we propose a new approach to discover informative contents from a set of tabular documents (or web pages) of a web site. our system, infodiscoverer, first partitions a page into several content blocks according to html tag in a web page. based on the occurrence of the features (terms) in the set of pages, it calculates entropy value of each feature. according to the entropy value of each feature in a content block, the entropy value of the block is defined. by analyzing. surrounding text:their experimental results show that template elimination improves the precision of the search engine clever at all levels of recall. another content-based approach is proposed by lin and ho [***]. their system, infodiscover, partitions a web page into several content blocks according tags - 3. page segmentation several methods have been explored to segment a web page into regions or blocks [4][***]. in the dom-based segmentation approach, an html document is represented as a dom tree - here we show several promising applications that may take advantage of the block importance model. block importance model is motivated by the urge to improve information retrieval performance, thus its direct application lies in this area [1][***]. search engine may benefit from block importance in two aspects: one is to improve the overall relevance rank of the returned pages after searching; the other is to locate the important regions carrying more information during searching influence:2 type:2 pair index:114 citer id:106 citer title:learning block importance models for web pages citer abstract:some previous works show that a web page can be partitioned to multiple segments or blocks, and usually the importance of those blocks in a page is not equivalent. also, it is proved that differentiating noisy or unimportant blocks from pages can facilitate web mining, search and accessibility. but in these works, no uniform approach or model is presented to measure the importance of different portions in web pages. through a user study, we found that people do have a consistent view about the importance of blocks in web pages. in this paper, we investigate how to find a model to automatically assign importance values to blocks in a web page. we define the block importance estimation as a learning problem. first, we use the vips (vision-based page segmentation) algorithm to partition a web page into semantic blocks with a hierarchy structure. then spatial features (such as position, size) and content features (such as the number of images and links) are extracted to construct a feature vector for each block. based on these features, learning algorithms, such as svm and neural network, are applied to train various block importance models. in our experiments, the best model can achieve the performance with micro-f1 79% and micro-accuracy 85.9%, which is quite close to a persons citee id:15 citee title:automatic browsing of large pictures on mobile devices citee abstract:ictures have become increasingly common and popular in mobile communications. however, due to the limitation of mobile devices, there is a need to develop new technologies to facilitate the browsing of large pictures on the small screen. in this paper, we propose a novel approach which is able to automate the scrolling and navigation of a large picture with a minimal amount of user interaction on mobile devices. an image attention model is employed to illustrate the information structure within an image. an optimal image browsing path is then calculated based on the image attention model to simulate the human browsing behaviors. experimental evaluations of the proposed mechanism indicate that our approach is an effective way for viewing large images on small displays. surrounding text:attention is a neurobiological conception. it means the concentration of mental powers on an object, a close or careful observing or listening [***]. at the first sight of a web page, attention may be caught by an image with bright color or animations in advertisement, but generally such an object is not the important part of the page influence:3 type:3 pair index:115 citer id:106 citer title:learning block importance models for web pages citer abstract:some previous works show that a web page can be partitioned to multiple segments or blocks, and usually the importance of those blocks in a page is not equivalent. also, it is proved that differentiating noisy or unimportant blocks from pages can facilitate web mining, search and accessibility. but in these works, no uniform approach or model is presented to measure the importance of different portions in web pages. through a user study, we found that people do have a consistent view about the importance of blocks in web pages. in this paper, we investigate how to find a model to automatically assign importance values to blocks in a web page. we define the block importance estimation as a learning problem. first, we use the vips (vision-based page segmentation) algorithm to partition a web page into semantic blocks with a hierarchy structure. then spatial features (such as position, size) and content features (such as the number of images and links) are extracted to construct a feature vector for each block. based on these features, learning algorithms, such as svm and neural network, are applied to train various block importance models. in our experiments, the best model can achieve the performance with micro-f1 79% and micro-accuracy 85.9%, which is quite close to a persons citee id:109 citee title:web page cleaning for web mining through feature weighting citee abstract:unlike conventional data or text, web pages typically contain a large amount of information that is not part of the main contents of the pages, e.g., banner ads, navigation bars, and copyright notices. such irrelevant information (which we call web page noise) in web pages can seriously harm web mining, e.g., clustering and classification. in this paper, we propose a novel feature weighting technique to deal with web page noise to enhance web mining. this method first builds a compressed structure tree to capture the common structure and comparable blocks in a set of web pages. it then uses an information based measure to evaluate the importance of each node in the compressed structure tree. based on the tree and its node importance values, our method assigns a weight to each word feature in its content block. the resulting weights are used in web mining. we evaluated the proposed technique with two web mining tasks, web page clustering and web page classification. experimental results show that our weighting method is able to dramatically improve the mining results. surrounding text:one class of techniques aims to detect the patterns among a number of web pages from the same web site. the common idea of these works is that in a given web site, noisy blocks usually share some common contents and presentation styles [***]. baryossef et al$ define the common parts among web pages as template [1] - an entropy-threshold is selected to decide whether a block is informative or redundant. different from these two works, yi and liu make use of the common presentation style [***][16]. a style tree is defined to represent both layout and content of a web page - for example, in pseudo-relevance feedback technique, the expansion terms will be selected from the region with high importance and block importance could also be combined into term weights. another application of block importance is on web page classification [9][***][16]. for most of the existing techniques, features used for classification are selected from the whole page - in other words, features in important blocks will be chosen or have higher weights than features in unimportant blocks. there also are a few works beginning to explore this interesting topic [5] [***][16]. block importance can also be applied to facilitate the web adaptation applications driven by the proliferation of small mobile devices [8] influence:1 type:2 pair index:116 citer id:106 citer title:learning block importance models for web pages citer abstract:some previous works show that a web page can be partitioned to multiple segments or blocks, and usually the importance of those blocks in a page is not equivalent. also, it is proved that differentiating noisy or unimportant blocks from pages can facilitate web mining, search and accessibility. but in these works, no uniform approach or model is presented to measure the importance of different portions in web pages. through a user study, we found that people do have a consistent view about the importance of blocks in web pages. in this paper, we investigate how to find a model to automatically assign importance values to blocks in a web page. we define the block importance estimation as a learning problem. first, we use the vips (vision-based page segmentation) algorithm to partition a web page into semantic blocks with a hierarchy structure. then spatial features (such as position, size) and content features (such as the number of images and links) are extracted to construct a feature vector for each block. based on these features, learning algorithms, such as svm and neural network, are applied to train various block importance models. in our experiments, the best model can achieve the performance with micro-f1 79% and micro-accuracy 85.9%, which is quite close to a persons citee id:72 citee title:eliminating noisy information in web pages for data mining citee abstract:a commercial web page typically contains many information blocks. apart from the main content blocks, it usually has such blocks as navigation panels, copyright and privacy notices, and advertisements (for business purposes and for easy user access). we call these blocks that are not the main content blocks of the page the noisy blocks. we show that the information contained in these noisy blocks can seriously harm web data mining. eliminating these noises is thus of great importance. in this paper, we propose a noise elimination technique based on the following observation: in a given web site, noisy blocks usually share some common contents and presentation styles, while the main content blocks of the pages are often diverse in their actual contents and/or presentation styles. based on this observation, we propose a tree structure, called style tree, to capture the common presentation styles and the actual contents of the pages in a given web site. by sampling the pages of the site, a style tree can be built for the site, which we call the site style tree (sst). we then introduce an information based measure to determine which parts of the sst represent noises and which parts represent the main contents of the site. the sst is employed to detect and eliminate noises in any web page of the site by mapping this page to the sst. the proposed technique is evaluated with two data mining tasks, web page clustering and classification. experimental results show that our noise elimination technique is able to improve the mining results significantly surrounding text:an entropy-threshold is selected to decide whether a block is informative or redundant. different from these two works, yi and liu make use of the common presentation style [15][***]. a style tree is defined to represent both layout and content of a web page - for example, in pseudo-relevance feedback technique, the expansion terms will be selected from the region with high importance and block importance could also be combined into term weights. another application of block importance is on web page classification [9][15][***]. for most of the existing techniques, features used for classification are selected from the whole page - in other words, features in important blocks will be chosen or have higher weights than features in unimportant blocks. there also are a few works beginning to explore this interesting topic [5] [15][***]. block importance can also be applied to facilitate the web adaptation applications driven by the proliferation of small mobile devices [8] influence:2 type:2 pair index:117 citer id:106 citer title:learning block importance models for web pages citer abstract:some previous works show that a web page can be partitioned to multiple segments or blocks, and usually the importance of those blocks in a page is not equivalent. also, it is proved that differentiating noisy or unimportant blocks from pages can facilitate web mining, search and accessibility. but in these works, no uniform approach or model is presented to measure the importance of different portions in web pages. through a user study, we found that people do have a consistent view about the importance of blocks in web pages. in this paper, we investigate how to find a model to automatically assign importance values to blocks in a web page. we define the block importance estimation as a learning problem. first, we use the vips (vision-based page segmentation) algorithm to partition a web page into semantic blocks with a hierarchy structure. then spatial features (such as position, size) and content features (such as the number of images and links) are extracted to construct a feature vector for each block. based on these features, learning algorithms, such as svm and neural network, are applied to train various block importance models. in our experiments, the best model can achieve the performance with micro-f1 79% and micro-accuracy 85.9%, which is quite close to a persons citee id:93 citee title:improving pseudo-relevance feedback in web information retrieval using web page segmentation citee abstract:in contrast to traditional document retrieval, a web page as a whole is not a good information unit to search because it often contains multiple topics and a lot of irrelevant information from navigation, decoration, and interaction part of the page. in this paper, we propose a vision-based page segmentation (vips) algorithm to detect the semantic content structure in a web page. compared with simple dom based segmentation method, our page segmentation scheme utilizes useful visual cues to obtain a better partition of a page at the semantic level. by using our vips algorithm to assist the selection of query expansion terms in pseudo-relevance feedback in web information retrieval, we achieve 27% performance improvement on web track dataset surrounding text:there are several kinds of methods for web page segmentation. most popular ones are dom-based segmentation [4], location-based segmentation [9] and vision-based page segmentation (vips) [***][3]. these methods distinguish each other by considering various factors as the partition basis influence:2 type:2 pair index:118 citer id:114 citer title:link fusion: a unified link analysis framework for multi-type interrelated data objects citer abstract:web link analysis has proven to be a significant enhancement for quality based web search. most existing links can be classified into two categories: intra-type links (e.g., web hyperlinks), which represent the relationship of data objects within a homogeneous data type (web pages), and inter-type links (e.g., user browsing log) which represent the relationship of data objects across different data types (users and web pages). unfortunately, most link analysis research only considers one type of link. in this paper, we propose a unified link analysis framework, called link fusion, which considers both the inter- and intra- type link structure among multiple-type inter-related data objects and brings order to objects in each data type at the same time. the pagerank and hits algorithms are shown to be special cases of our unified link analysis framework. experiments on an instantiation of the framework that makes use of the user data and web pages extracted from a proxy log show that our proposed algorithm could improve the search effectiveness over the hits and directhit algorithms by 24.6% and 38.2% respectively citee id:86 citee title:improved algorithms for topic distillation in a hyperlinked environment citee abstract:this paper addresses the problem of topic distillation on theworld wideweb, namely, given a typical user query to find quality documents related to the query topic. connectivity analysis has been shown to be useful in identifying high quality pages within a topic specific graph of hyperlinked documents. the essence of our approach is to augment a previous connectivity analysis based algorithm with content analysis. we identify three problems with the existing approach and devise algorithms to tackle them. the results of a user evaluation are reported that show an improvement of precision at 10 documents by at least 45% over pure connectivity analysis surrounding text:with such huge volume and great variation in contents, finding useful information effectively from the web becomes a very challenging job. traditional keyword based text search engines cannot provide satisfying results to web queries since: (1) users tend to submit very short, sometime ambiguous queries and they are reluctant to provide feedback information [***]. (2) the quality of web pages varies greatly [6], and users usually prefer high quality pages over low quality pages in the result set returned by the search engine - =+ (3) there are two issues that need to be considered in eq. (3): first, as noted by bharat and henzinger [***], mutually reinforcing relationships between objects may give undue weight to objects. ideally, we would like all the objects to have the same influence on the other objects they connect to - then, the root set is expanded to the base set by its neighborhoods, which are the web pages that either point to or are pointed at by pages in the root set. in this experiment, we set the maximum in-degree of nodes as 50, which is commonly adopted by the previous works [***, 17]. the expanded set of web pages forms the data objects in hub space and authority space influence:2 type:3 pair index:119 citer id:114 citer title:link fusion: a unified link analysis framework for multi-type interrelated data objects citer abstract:web link analysis has proven to be a significant enhancement for quality based web search. most existing links can be classified into two categories: intra-type links (e.g., web hyperlinks), which represent the relationship of data objects within a homogeneous data type (web pages), and inter-type links (e.g., user browsing log) which represent the relationship of data objects across different data types (users and web pages). unfortunately, most link analysis research only considers one type of link. in this paper, we propose a unified link analysis framework, called link fusion, which considers both the inter- and intra- type link structure among multiple-type inter-related data objects and brings order to objects in each data type at the same time. the pagerank and hits algorithms are shown to be special cases of our unified link analysis framework. experiments on an instantiation of the framework that makes use of the user data and web pages extracted from a proxy log show that our proposed algorithm could improve the search effectiveness over the hits and directhit algorithms by 24.6% and 38.2% respectively citee id:108 citee title:the anatomy of a large-scale hypertextual web search engine citee abstract:in this paper, we present google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. google is designed to crawl and index the web efficiently and produce much more satisfying search results than existing systems. the prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/ to engineer a search engine is a challenging task. search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. they answer tens of millions of queries every day. despite the importance of large-scale search engines on the web, very little academic research has been done on them. furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago. this paper provides an in-depth description of our large-scale web search engine -- the first such detailed public description we know of to date. apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. this paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext. also we look at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want surrounding text:most existing web related research fits into the web multi-space model we described in figure 1. for example, web search [***, 17] uses the web page space and hyperlinks within the space; collaborate filtering [14] uses the document (web page) space, the user space, and the browsing relationship in-between; web query clustering [22] uses the web page space, query space, and reference relationship in-between. unfortunately, most of these works only consider one type of link/relationship when analyzing the links/relationships of objects, and they can be classified into intra-type link analysis and inter-type link analysis regarding the type of links they use - in intra-type link analysis, the attribute of a data object is directly reinforced by the same attribute of other data objects in the same data space. for example, in googles pagerank algorithm [***], the popularity attributes of web pages are reinforcing each other via the hyper-link structure within them. in inter-type link analysis, the attribute of one type of data objects is reinforced by attributes of data objects from other data spaces - it is easy to find out that w is the principle eigenvector of a. following the same rationale, brin and page [***] design the pagerank algorithm to calculate the importance of web pages in the web. in addition to pinski and narins algorithm, pagerank simulates a web surfers behavior on the web influence:1 type:1,2 pair index:120 citer id:114 citer title:link fusion: a unified link analysis framework for multi-type interrelated data objects citer abstract:web link analysis has proven to be a significant enhancement for quality based web search. most existing links can be classified into two categories: intra-type links (e.g., web hyperlinks), which represent the relationship of data objects within a homogeneous data type (web pages), and inter-type links (e.g., user browsing log) which represent the relationship of data objects across different data types (users and web pages). unfortunately, most link analysis research only considers one type of link. in this paper, we propose a unified link analysis framework, called link fusion, which considers both the inter- and intra- type link structure among multiple-type inter-related data objects and brings order to objects in each data type at the same time. the pagerank and hits algorithms are shown to be special cases of our unified link analysis framework. experiments on an instantiation of the framework that makes use of the user data and web pages extracted from a proxy log show that our proposed algorithm could improve the search effectiveness over the hits and directhit algorithms by 24.6% and 38.2% respectively citee id:17 citee title:automatic resource compilation by analyzing hyperlink structure and associated text citee abstract:we describe the design, prototyping and evaluation of arc, a system for automatically compiling a list of authoritative web resources on any (sufficiently broad) topic. the goal of arc is to compile resource lists similar to those provided by yahoo! or infoseek. the fundamental difference is that these services construct lists either manually or through a combination of human and automated effort, while arc operates fully automatically. we describe the evaluation of arc, yahoo!, and infoseek resource lists by a panel of human users. this evaluation suggests that the resources found by arc frequently fare almost as well as, and sometimes better than, lists of resources that are manually compiled or classified into a topic. we also provide examples of arc resource lists for the reader to examine. surrounding text:many researchers have extended the hits algorithms to improve its efficiency. chakrabarti et al$ [***, 6] used texts that surround hyperlinks in source web pages to help express the content of destination web pages. they also reduce weight factors of hyperlinks from the same domain to avoid a single website dominating the results of hits influence:2 type:2 pair index:121 citer id:114 citer title:link fusion: a unified link analysis framework for multi-type interrelated data objects citer abstract:web link analysis has proven to be a significant enhancement for quality based web search. most existing links can be classified into two categories: intra-type links (e.g., web hyperlinks), which represent the relationship of data objects within a homogeneous data type (web pages), and inter-type links (e.g., user browsing log) which represent the relationship of data objects across different data types (users and web pages). unfortunately, most link analysis research only considers one type of link. in this paper, we propose a unified link analysis framework, called link fusion, which considers both the inter- and intra- type link structure among multiple-type inter-related data objects and brings order to objects in each data type at the same time. the pagerank and hits algorithms are shown to be special cases of our unified link analysis framework. experiments on an instantiation of the framework that makes use of the user data and web pages extracted from a proxy log show that our proposed algorithm could improve the search effectiveness over the hits and directhit algorithms by 24.6% and 38.2% respectively citee id:115 citee title:mining the web's link structure citee abstract:the web is a hypertext body of approximately 300 million pages that continues to grow at roughly a million pages per day. page variation is more prodigious than the data's raw scale: taken as a whole, the set of web pages lacks a unifying structure and shows far more authoring style and content variation than that seen in traditional text-document collections. this level of complexity makes an "off-the-shelf" database-management and information-retrieval solution impossible. to date, index-based search engines for the web have been the primary tool by which users search for information. such engines can build giant indices that let you quickly retrieve the set of all web pages containing a given word or string. experienced users can make effective use of such engines for tasks that can be solved by searching for tightly constrained keywords and phrases. these search engines are, however, unsuited for a wide range of equally important tasks. in particular, a topic of any breadth will typically contain several thousand or million relevant web pages. how then, from this sea of pages, should a search engine select the correct ones-those of most value to the user? surrounding text:traditional keyword based text search engines cannot provide satisfying results to web queries since: (1) users tend to submit very short, sometime ambiguous queries and they are reluctant to provide feedback information [3]. (2) the quality of web pages varies greatly [***], and users usually prefer high quality pages over low quality pages in the result set returned by the search engine. (3) a non-trivial number of web queries target at finding a navigational starting point [9] or url of a known-item [8] on the web - many researchers have extended the hits algorithms to improve its efficiency. chakrabarti et al$ [5, ***] used texts that surround hyperlinks in source web pages to help express the content of destination web pages. they also reduce weight factors of hyperlinks from the same domain to avoid a single website dominating the results of hits influence:2 type:3 pair index:122 citer id:114 citer title:link fusion: a unified link analysis framework for multi-type interrelated data objects citer abstract:web link analysis has proven to be a significant enhancement for quality based web search. most existing links can be classified into two categories: intra-type links (e.g., web hyperlinks), which represent the relationship of data objects within a homogeneous data type (web pages), and inter-type links (e.g., user browsing log) which represent the relationship of data objects across different data types (users and web pages). unfortunately, most link analysis research only considers one type of link. in this paper, we propose a unified link analysis framework, called link fusion, which considers both the inter- and intra- type link structure among multiple-type inter-related data objects and brings order to objects in each data type at the same time. the pagerank and hits algorithms are shown to be special cases of our unified link analysis framework. experiments on an instantiation of the framework that makes use of the user data and web pages extracted from a proxy log show that our proposed algorithm could improve the search effectiveness over the hits and directhit algorithms by 24.6% and 38.2% respectively citee id:111 citee title:learning to probabilistically identify authoritative documents citee abstract:we describe a model of document citation that learns to identify hubs and authorities in a set of linked documents, such as pages retrieved from the world wide web, or papers retrieved from a research paper archive. unlike the popular hits algorithm, which relies on dubious statistical assumptions, our model provides probabilistic estimates that have clear semantics. we also find that in general, the identified authoritative documents correspond better to human intuition surrounding text:cohn et al. [***] introduced a probabilistic factor into hits and applied the em model. all these show that the authority idea has great potential in web applications influence:2 type:2 pair index:123 citer id:114 citer title:link fusion: a unified link analysis framework for multi-type interrelated data objects citer abstract:web link analysis has proven to be a significant enhancement for quality based web search. most existing links can be classified into two categories: intra-type links (e.g., web hyperlinks), which represent the relationship of data objects within a homogeneous data type (web pages), and inter-type links (e.g., user browsing log) which represent the relationship of data objects across different data types (users and web pages). unfortunately, most link analysis research only considers one type of link. in this paper, we propose a unified link analysis framework, called link fusion, which considers both the inter- and intra- type link structure among multiple-type inter-related data objects and brings order to objects in each data type at the same time. the pagerank and hits algorithms are shown to be special cases of our unified link analysis framework. experiments on an instantiation of the framework that makes use of the user data and web pages extracted from a proxy log show that our proposed algorithm could improve the search effectiveness over the hits and directhit algorithms by 24.6% and 38.2% respectively citee id:116 citee title:overview of the trec-2002 web track citee abstract:the trec-2002 web track moved away from non-web relevance ranking and towards webspecific tasks on a 1.25 million page crawl .gov. the topic distillation task involved finding pages which were relevant, but also had characteristics which would make them desirable inclusions in a distilled list of key pages. the named page task is a variant of last years homepage finding task. the task is to find a particular page, but in this years task the page need not be a home page. surrounding text:(2) the quality of web pages varies greatly [6], and users usually prefer high quality pages over low quality pages in the result set returned by the search engine. (3) a non-trivial number of web queries target at finding a navigational starting point [9] or url of a known-item [***] on the web. thus, web pages containing textually similar content to the query may not be relevant at all influence:3 type:3 pair index:124 citer id:114 citer title:link fusion: a unified link analysis framework for multi-type interrelated data objects citer abstract:web link analysis has proven to be a significant enhancement for quality based web search. most existing links can be classified into two categories: intra-type links (e.g., web hyperlinks), which represent the relationship of data objects within a homogeneous data type (web pages), and inter-type links (e.g., user browsing log) which represent the relationship of data objects across different data types (users and web pages). unfortunately, most link analysis research only considers one type of link. in this paper, we propose a unified link analysis framework, called link fusion, which considers both the inter- and intra- type link structure among multiple-type inter-related data objects and brings order to objects in each data type at the same time. the pagerank and hits algorithms are shown to be special cases of our unified link analysis framework. experiments on an instantiation of the framework that makes use of the user data and web pages extracted from a proxy log show that our proposed algorithm could improve the search effectiveness over the hits and directhit algorithms by 24.6% and 38.2% respectively citee id:62 citee title:effective site finding using link anchor information citee abstract:link-based ranking methods have been described in the literature and applied in commercial web search engines. however, according to recent trec experiments, they are no better than traditional content-based methods. we conduct a different type of experiment, in which the task is to find the main entry point of a specific web site. in our experiments, ranking based on link anchor text is twice as effective as ranking based on document content, even though both methods used the same bm25 formula. we obtained these results using two sets of 100 queries on a 18.5 million document set and another set of 100 on a 0.4 million document set. this site finding effectiveness begins to explain why many search engines have adopted link methods. it also opens a rich new area for effectiveness improvement, where traditional methods fail surrounding text:(2) the quality of web pages varies greatly [6], and users usually prefer high quality pages over low quality pages in the result set returned by the search engine. (3) a non-trivial number of web queries target at finding a navigational starting point [***] or url of a known-item [8] on the web. thus, web pages containing textually similar content to the query may not be relevant at all influence:3 type:3 pair index:125 citer id:114 citer title:link fusion: a unified link analysis framework for multi-type interrelated data objects citer abstract:web link analysis has proven to be a significant enhancement for quality based web search. most existing links can be classified into two categories: intra-type links (e.g., web hyperlinks), which represent the relationship of data objects within a homogeneous data type (web pages), and inter-type links (e.g., user browsing log) which represent the relationship of data objects across different data types (users and web pages). unfortunately, most link analysis research only considers one type of link. in this paper, we propose a unified link analysis framework, called link fusion, which considers both the inter- and intra- type link structure among multiple-type inter-related data objects and brings order to objects in each data type at the same time. the pagerank and hits algorithms are shown to be special cases of our unified link analysis framework. experiments on an instantiation of the framework that makes use of the user data and web pages extracted from a proxy log show that our proposed algorithm could improve the search effectiveness over the hits and directhit algorithms by 24.6% and 38.2% respectively citee id:117 citee title:toward a unification of text and link analysis citee abstract:this paper presents a simple yet profound idea. by thinking about the relationships between and within terms and documents, we can generate a richer representation that encompasses aspects of web link analysis as well as text analysis techniques from information retrieval. this paper shows one path to this unified representation, and demonstrates the use of eigenvector calculations from web link analysis by stepping through a simple example surrounding text:the users importance is ignored in this algorithm. most recently, davison [***] analyzed multiple term document relationships by expanding the traditional document-term matrix into a matrix with term-term and doc-doc sub-matrices in the diagonal direction and term-doc and doc-term sub-matrices in the anti-diagonal direction. the term-term sub-matrix represents term relationships (e influence:2 type:2 pair index:126 citer id:114 citer title:link fusion: a unified link analysis framework for multi-type interrelated data objects citer abstract:web link analysis has proven to be a significant enhancement for quality based web search. most existing links can be classified into two categories: intra-type links (e.g., web hyperlinks), which represent the relationship of data objects within a homogeneous data type (web pages), and inter-type links (e.g., user browsing log) which represent the relationship of data objects across different data types (users and web pages). unfortunately, most link analysis research only considers one type of link. in this paper, we propose a unified link analysis framework, called link fusion, which considers both the inter- and intra- type link structure among multiple-type inter-related data objects and brings order to objects in each data type at the same time. the pagerank and hits algorithms are shown to be special cases of our unified link analysis framework. experiments on an instantiation of the framework that makes use of the user data and web pages extracted from a proxy log show that our proposed algorithm could improve the search effectiveness over the hits and directhit algorithms by 24.6% and 38.2% respectively citee id:22 citee title:citation analysis as a tool in journal evaluation citee abstract:as a communications system, the network of journals that play a paramount role in the exchange of scientific and technical information is little understood. periodically since 1927, when gross and gross published their study (1) of references in 1 years issues of the journal of the american chemical socie/y, pieces of the network have been illuminated by the work of bradford (2), allen (3), gross and woodford (4), hooker (5), henkle (6), fussier (7), brown (8), and others (9). nevertheless, there is still no map of the journal network as a whok. to date, studies of the network and of the interrelation of its components have been limited in the number of journak, the areas of scientific study, and the periods of time their authors were able to consider, such shortcomings have not been due to any lack of purpose, insight, or energy on the part of investigators, but to the practical difficulty of compiling and manipulating manually the enormous amount of necessary data. surrounding text:researchers from the bibliometrics area claimed that scientific citations could be regarded as a special social network, where journals and papers are the nodes and the citation relationships are edges in the graph. garfields famous impact factor [***] calculates the importance of a journal by counting the citations the journal received (the in-link) within a fixed amount of time. pinski and narin [20] claimed that the importance of a journal is recursively defined as the sum of the importance of all journals that cited it influence:3 type:2 pair index:127 citer id:114 citer title:link fusion: a unified link analysis framework for multi-type interrelated data objects citer abstract:web link analysis has proven to be a significant enhancement for quality based web search. most existing links can be classified into two categories: intra-type links (e.g., web hyperlinks), which represent the relationship of data objects within a homogeneous data type (web pages), and inter-type links (e.g., user browsing log) which represent the relationship of data objects across different data types (users and web pages). unfortunately, most link analysis research only considers one type of link. in this paper, we propose a unified link analysis framework, called link fusion, which considers both the inter- and intra- type link structure among multiple-type inter-related data objects and brings order to objects in each data type at the same time. the pagerank and hits algorithms are shown to be special cases of our unified link analysis framework. experiments on an instantiation of the framework that makes use of the user data and web pages extracted from a proxy log show that our proposed algorithm could improve the search effectiveness over the hits and directhit algorithms by 24.6% and 38.2% respectively citee id:82 citee title:graph theory in practice citee abstract:what is the diameter of the world wide web? the answer is not 7,927 miles, even though the web truly is world wide. according to albert-lszl barabsi, reka albert and hawoong jeong of notre dame university, the diameter of the web is 19. the diameter in question is not a geometric distance; the concept comes from the branch of mathematics called graph theory. on the web, you get from place to place by clicking on hypertext links, and so it makes sense to define distance by counting your steps through such links. the question is: if you select two web pages at random, how many links will separate them, on average? among the 800 million pages on the web, there's room to wander down some very long paths, but barabsi et al. find that if you know where you're going, you can get just about anywhere in 19 clicks of the mouse. barabsi's calculation reflects an interesting shift in the style and the technology of graph theory. just a few years ago it would have been unusual to apply graph-theoretical methods to such an enormous structure as the world wide web. of course just a few years ago the web didn't exist. now, very large netlike objects seem to be everywhere, and many of them invite graph-theoretical analysis. perhaps it is time to speak not only of graph theory but also of graph practice, or even graph engineering. surrounding text:finally, we conclude in section 5 figure 1: an example of multi-type interrelated data spaces 2. related works research on analyzing link structures to better understand the informational organization within data spaces can be traced back to research on social networks [***]. a good example comes from the telephone bill graph influence:1 type:1 pair index:128 citer id:114 citer title:link fusion: a unified link analysis framework for multi-type interrelated data objects citer abstract:web link analysis has proven to be a significant enhancement for quality based web search. most existing links can be classified into two categories: intra-type links (e.g., web hyperlinks), which represent the relationship of data objects within a homogeneous data type (web pages), and inter-type links (e.g., user browsing log) which represent the relationship of data objects across different data types (users and web pages). unfortunately, most link analysis research only considers one type of link. in this paper, we propose a unified link analysis framework, called link fusion, which considers both the inter- and intra- type link structure among multiple-type inter-related data objects and brings order to objects in each data type at the same time. the pagerank and hits algorithms are shown to be special cases of our unified link analysis framework. experiments on an instantiation of the framework that makes use of the user data and web pages extracted from a proxy log show that our proposed algorithm could improve the search effectiveness over the hits and directhit algorithms by 24.6% and 38.2% respectively citee id:1 citee title:a new status index derived from sociometric analysis citee abstract:for the purpose of evaluating status in a manner free from the deficiencies of popularity contest procedures, this paper presents a new method of computation which takes into accountwho chooses as well ashow many choose. it is necessary to introduce, in this connection, the concept of attenuation in influence transmitted through intermediaries surrounding text:the problem of link structure of social networks can be reduced to a graph g = (v, e), where set v refers to people, and set e refers to the relationship among people. katz [***] tried to measure the importance of a node in a graph by calculating the in-degree (both direct and indirect) of that node. hubbell [15] tried to do the same thing by propagating the importance weights on the graph so that the weight of each node achieves equilibrium influence:2 type:2 pair index:129 citer id:114 citer title:link fusion: a unified link analysis framework for multi-type interrelated data objects citer abstract:web link analysis has proven to be a significant enhancement for quality based web search. most existing links can be classified into two categories: intra-type links (e.g., web hyperlinks), which represent the relationship of data objects within a homogeneous data type (web pages), and inter-type links (e.g., user browsing log) which represent the relationship of data objects across different data types (users and web pages). unfortunately, most link analysis research only considers one type of link. in this paper, we propose a unified link analysis framework, called link fusion, which considers both the inter- and intra- type link structure among multiple-type inter-related data objects and brings order to objects in each data type at the same time. the pagerank and hits algorithms are shown to be special cases of our unified link analysis framework. experiments on an instantiation of the framework that makes use of the user data and web pages extracted from a proxy log show that our proposed algorithm could improve the search effectiveness over the hits and directhit algorithms by 24.6% and 38.2% respectively citee id:13 citee title:authoritative sources in a hyperlinked environment citee abstract:the network structure of a hyperlinked environment can be a rich source of information about the content of the environment, provided we have e ective means for understanding it. we develop a set of algorithmic tools for extracting information from the link structures of such environments, and report on experiments that demonstrate their e ectiveness in a variety of contexts on the world wide web. the central issue we address within our framework is the distillation of broad search topics, through the discovery of \authoritative" information sources on such topics. we propose and test an algorithmic formulation of the notion of authority, based on the relationship between a set of relevant authoritative pages and the set of \hub pages" that join them together in the link structure. our formulation has connections to the eigenvectors of certain matrices associated with the link graph; these connections in turn motivate additional heuristics for link-based analysis. surrounding text:based on the observations above, researchers tried different approaches to improve the effectiveness of web search engines. one of the representative solutions is re-ranking the top retrieved web pages by their importance [1, 11, ***], which is calculated by analyzing the hyperlinks among web pages. hyperlink analysis (such as [1, 3-7, 17-19]) has been shown to achieve much better performance than full text search, in production systems - most existing web related research fits into the web multi-space model we described in figure 1. for example, web search [4, ***] uses the web page space and hyperlinks within the space; collaborate filtering [14] uses the document (web page) space, the user space, and the browsing relationship in-between; web query clustering [22] uses the web page space, query space, and reference relationship in-between. unfortunately, most of these works only consider one type of link/relationship when analyzing the links/relationships of objects, and they can be classified into intra-type link analysis and inter-type link analysis regarding the type of links they use - 6% and 38. 2% compared to the traditional hits [***] and directhit [11] algorithms, respectively. browse issue reference user query web-page the rest of this paper is organized as follows - 1/u=((+. ij n w 1))tumwŦ= kleinberg [***] claimed that web pages and scientific documents are governed by different principles. journals have approximately the same purpose, and highly authoritative journals always refer to other authoritative journals - then, the root set is expanded to the base set by its neighborhoods, which are the web pages that either point to or are pointed at by pages in the root set. in this experiment, we set the maximum in-degree of nodes as 50, which is commonly adopted by the previous works [3, ***]. the expanded set of web pages forms the data objects in hub space and authority space influence:2 type:2 pair index:130 citer id:120 citer title:object-level ranking: bringing order to web objects citer abstract:in contrast with the current web search methods that essentially do document-level ranking and retrieval, we are exploring a new paradigm to enable web search at the object level. we collect web information for objects relevant for a specic application domain and rank these objects in terms of their relevance and popularity to answer user queries. traditional pagerank model is no longer valid for object popularity calculation because of the existence of heterogeneous relationships between objects. this paper introduces poprank, a domain-independent object-level link analysis model to rank the objects within a specic domain. specically we assign a popularity propagation factor to each type of object relationship, study how di?erent popularity propagation factors for these heterogeneous relationships could affect the popularity ranking, and propose ecient approaches to automatically decide these factors. our experiments are done using 1 million cs papers, and the experimental results show that poprank can achieve signicantly better ranking results than naively applying pagerank on the object graph citee id:121 citee title:wrapper generation for semi-structured internet sources citee abstract:with the current explosion of information on the world wide web (www) a wealth of information on many different subjects has become available on-line. numerous sources contain information that can be classified as semi-structured. at present, however, the only way to access the information is by browsing individual pages. we cannot query web documents in a database-like fashion based on their underlying structure. however, we can provide database-like querying for semi-structured www sources by building wrappers around these sources. we present an approach for semi-automatically generating such wrappers. the key idea is to exploit the formatting information in pages from the source to hypothesize the underlying structure of a page. from this structure the system generates a wrapper that facilitates querying of a source and possibly integrating it with other sources. we demonstrate the ease with which we are able to build wrappers for a number of internet sources in different domains using our implemented wrapper generation toolkit surrounding text:we can imagine that if these objects can be extracted and integrated from the web, powerful object-level search engines can be built to meet users' information needs more precisely, especially for some specic domains. such a perspective has lead to signicant interests in research communities, and related technologies such as wrapper deduction [11, ***], web database schema matching [16, 8], and object identication on the web [15] have been developed in recent years. these techniques made it possible for us to extract and integrate all the related web information about the same object together as an information unit - 8. references [***] n. ashish and c influence:3 type:3 pair index:131 citer id:120 citer title:object-level ranking: bringing order to web objects citer abstract:in contrast with the current web search methods that essentially do document-level ranking and retrieval, we are exploring a new paradigm to enable web search at the object level. we collect web information for objects relevant for a specic application domain and rank these objects in terms of their relevance and popularity to answer user queries. traditional pagerank model is no longer valid for object popularity calculation because of the existence of heterogeneous relationships between objects. this paper introduces poprank, a domain-independent object-level link analysis model to rank the objects within a specic domain. specically we assign a popularity propagation factor to each type of object relationship, study how di?erent popularity propagation factors for these heterogeneous relationships could affect the popularity ranking, and propose ecient approaches to automatically decide these factors. our experiments are done using 1 million cs papers, and the experimental results show that poprank can achieve signicantly better ranking results than naively applying pagerank on the object graph citee id:14 citee title:authority-based keyword queries in databases using objectrank citee abstract:the objectrank system applies authority-based ranking to keyword search in databases modeled as labeled graphs. conceptually, authority originates at the nodes (objects) containing the keywords and flows to objects according to their semantic connections. each node is ranked according to its authority with respect to the particular keywords. one can adjust the weight of global importance, the weight of each keyword of the query, the importance of a result actually containing the keywords versus surrounding text:balmin et al. propose the objectrank system [***] which applies the random walk model to keyword search in databases modelled as labelled graphs. a similar notion of our popularity propagation factors called authority transfer rates is introduced influence:1 type:2 pair index:132 citer id:120 citer title:object-level ranking: bringing order to web objects citer abstract:in contrast with the current web search methods that essentially do document-level ranking and retrieval, we are exploring a new paradigm to enable web search at the object level. we collect web information for objects relevant for a specic application domain and rank these objects in terms of their relevance and popularity to answer user queries. traditional pagerank model is no longer valid for object popularity calculation because of the existence of heterogeneous relationships between objects. this paper introduces poprank, a domain-independent object-level link analysis model to rank the objects within a specic domain. specically we assign a popularity propagation factor to each type of object relationship, study how di?erent popularity propagation factors for these heterogeneous relationships could affect the popularity ranking, and propose ecient approaches to automatically decide these factors. our experiments are done using 1 million cs papers, and the experimental results show that poprank can achieve signicantly better ranking results than naively applying pagerank on the object graph citee id:108 citee title:the anatomy of a large-scale hypertextual web search engine citee abstract:in this paper, we present google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. google is designed to crawl and index the web efficiently and produce much more satisfying search results than existing systems. the prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/ to engineer a search engine is a challenging task. search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. they answer tens of millions of queries every day. despite the importance of large-scale search engines on the web, very little academic research has been done on them. furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago. this paper provides an in-depth description of our large-scale web search engine -- the first such detailed public description we know of to date. apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. this paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext. also we look at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want surrounding text:since the information about an object is normally represented as a block of a web page [12, 14]. the web popularity can be computed by considering the pagerank [***, 13] scores of web pages containing the object and the importance of the web page blocks [14, 5]. we assume web databases will uniformly propagate its popularity (i - in this way, we can also handle structured queries very well, and at the same time, give di. erent attribute-weights to the hits [***] in di. erent attributes in calculating the relevance score influence:1 type:1,2 pair index:133 citer id:120 citer title:object-level ranking: bringing order to web objects citer abstract:in contrast with the current web search methods that essentially do document-level ranking and retrieval, we are exploring a new paradigm to enable web search at the object level. we collect web information for objects relevant for a specic application domain and rank these objects in terms of their relevance and popularity to answer user queries. traditional pagerank model is no longer valid for object popularity calculation because of the existence of heterogeneous relationships between objects. this paper introduces poprank, a domain-independent object-level link analysis model to rank the objects within a specic domain. specically we assign a popularity propagation factor to each type of object relationship, study how di?erent popularity propagation factors for these heterogeneous relationships could affect the popularity ranking, and propose ecient approaches to automatically decide these factors. our experiments are done using 1 million cs papers, and the experimental results show that poprank can achieve signicantly better ranking results than naively applying pagerank on the object graph citee id:21 citee title:block-level link analysis citee abstract:link analysis has shown great potential in improving the per-formance of web search. pagerank and hits are two of the most popular algorithms. most of the existing link analysis algorithms treat a web page as a single node in the web graph. however, in most cases, a web page contains multiple semantics and hence the web page might not be considered as the atomic node. in this paper, the web page is partitioned into blocks using the vision-based page segmentation algorithm. by extracting the page-to-block, block-to-page relationships from link structure and page layout analysis, we can construct a semantic graph over the www such that each node exactly represents a single semantic topic. this graph can better describe the semantic structure of the web. based on block-level link analysis, we proposed two new algorithms, block level pagerank and block level hits, whose performances we study extensively using web data surrounding text:since the information about an object is normally represented as a block of a web page [12, 14]. the web popularity can be computed by considering the pagerank [4, 13] scores of web pages containing the object and the importance of the web page blocks [14, ***]. we assume web databases will uniformly propagate its popularity (i influence:2 type:2 pair index:134 citer id:120 citer title:object-level ranking: bringing order to web objects citer abstract:in contrast with the current web search methods that essentially do document-level ranking and retrieval, we are exploring a new paradigm to enable web search at the object level. we collect web information for objects relevant for a specic application domain and rank these objects in terms of their relevance and popularity to answer user queries. traditional pagerank model is no longer valid for object popularity calculation because of the existence of heterogeneous relationships between objects. this paper introduces poprank, a domain-independent object-level link analysis model to rank the objects within a specic domain. specically we assign a popularity propagation factor to each type of object relationship, study how di?erent popularity propagation factors for these heterogeneous relationships could affect the popularity ranking, and propose ecient approaches to automatically decide these factors. our experiments are done using 1 million cs papers, and the experimental results show that poprank can achieve signicantly better ranking results than naively applying pagerank on the object graph citee id:122 citee title:xrank: ranked keyword search over xml documents citee abstract:we consider the problem of efficiently producing ranked results for keyword search queries over hyperlinked xml documents. evaluating keyword search queries over hierarchical xml documents, as opposed to (conceptually) flat html documents, introduces many new challenges. first, xml keyword search queries do not always return entire documents, but can return deeply nested xml elements that contain the desired keywords. second, the nested structure of xml implies that the notion of ranking is no surrounding text:guo et al. [***] introduce xrank to rank xml elements using the link structure of the database. balmin et al influence:1 type:2 pair index:135 citer id:120 citer title:object-level ranking: bringing order to web objects citer abstract:in contrast with the current web search methods that essentially do document-level ranking and retrieval, we are exploring a new paradigm to enable web search at the object level. we collect web information for objects relevant for a specic application domain and rank these objects in terms of their relevance and popularity to answer user queries. traditional pagerank model is no longer valid for object popularity calculation because of the existence of heterogeneous relationships between objects. this paper introduces poprank, a domain-independent object-level link analysis model to rank the objects within a specic domain. specically we assign a popularity propagation factor to each type of object relationship, study how di?erent popularity propagation factors for these heterogeneous relationships could affect the popularity ranking, and propose ecient approaches to automatically decide these factors. our experiments are done using 1 million cs papers, and the experimental results show that poprank can achieve signicantly better ranking results than naively applying pagerank on the object graph citee id:49 citee title:discovering complex matchings across web query interfaces: a correlation mining approach citee abstract:to enable information integration, schema matching is a critical step for discovering semantic correspondences of attributes across heterogeneous sources. while complex matchings are common, because of their far more complex search space, most existing techniques focus on simple 1:1 matchings. to tackle this challenge, this paper takes a conceptually novel approach by viewing schema matching as correlation mining, for our task of matching web query interfaces to integrate the myriad databases on the internet. on this "deep web," query interfaces generally form complex matchings between attribute groups (e.g., corresponds to in the books domain). we observe that the co-occurrences patterns across query interfaces often reveal such complex semantic relationships: grouping attributes (e.g., ) tend to be co-present in query interfaces and thus positively correlated. in contrast, synonym attributes are negatively correlated because they rarely co-occur. this insight enables us to discover complex matchings by a correlation mining approach. in particular, we develop the dcm framework, which consists of data preparation, dual mining of positive and negative correlations, and finally matching selection. unlike previous correlation mining algorithms, which mainly focus on finding strong positive correlations, our algorithm cares both positive and negative correlations, especially the subtlety of negative correlations, due to its special importance in schema matching. this leads to the introduction of a new correlation measure, $h$-measure, distinct from those proposed in previous work. we evaluate our approach extensively and the results show good accuracy for discovering complex matchings surrounding text:we can imagine that if these objects can be extracted and integrated from the web, powerful object-level search engines can be built to meet users' information needs more precisely, especially for some specic domains. such a perspective has lead to signicant interests in research communities, and related technologies such as wrapper deduction [11, 2], web database schema matching [16, ***], and object identication on the web [15] have been developed in recent years. these techniques made it possible for us to extract and integrate all the related web information about the same object together as an information unit influence:3 type:3 pair index:136 citer id:120 citer title:object-level ranking: bringing order to web objects citer abstract:in contrast with the current web search methods that essentially do document-level ranking and retrieval, we are exploring a new paradigm to enable web search at the object level. we collect web information for objects relevant for a specic application domain and rank these objects in terms of their relevance and popularity to answer user queries. traditional pagerank model is no longer valid for object popularity calculation because of the existence of heterogeneous relationships between objects. this paper introduces poprank, a domain-independent object-level link analysis model to rank the objects within a specic domain. specically we assign a popularity propagation factor to each type of object relationship, study how di?erent popularity propagation factors for these heterogeneous relationships could affect the popularity ranking, and propose ecient approaches to automatically decide these factors. our experiments are done using 1 million cs papers, and the experimental results show that poprank can achieve signicantly better ranking results than naively applying pagerank on the object graph citee id:123 citee title:optimization by simulated annealing citee abstract:there is a deep and useful connection between statistical mechanics (the behavior of systems with many degrees of freedom in thermal equilibrium at a finite temperature) and multivariate or combinatorial optimization (finding the minimum of a given function depending on many parameters). a detailed analogy with annealing in solids provides a framework for optimization of the properties of very large and complex systems. this connection to statistical mechanics exposes new information and provides an unfamiliar perspective on traditional optimization problems and methods. surrounding text:there is a number heuristic based search algorithms that could be adapted to solve the problem. in figure 3 we show the safa (simulated annealing for factor 569 assignment) algorithm which adapts the simulated annealing algorithm [***] to automatically assign popularity propagation factors. algorithm safa(timeout: stopping condition) for (each object type x) n influence:2 type:1 pair index:137 citer id:120 citer title:object-level ranking: bringing order to web objects citer abstract:in contrast with the current web search methods that essentially do document-level ranking and retrieval, we are exploring a new paradigm to enable web search at the object level. we collect web information for objects relevant for a specic application domain and rank these objects in terms of their relevance and popularity to answer user queries. traditional pagerank model is no longer valid for object popularity calculation because of the existence of heterogeneous relationships between objects. this paper introduces poprank, a domain-independent object-level link analysis model to rank the objects within a specic domain. specically we assign a popularity propagation factor to each type of object relationship, study how di?erent popularity propagation factors for these heterogeneous relationships could affect the popularity ranking, and propose ecient approaches to automatically decide these factors. our experiments are done using 1 million cs papers, and the experimental results show that poprank can achieve signicantly better ranking results than naively applying pagerank on the object graph citee id:13 citee title:authoritative sources in a hyperlinked environment citee abstract:the network structure of a hyperlinked environment can be a rich source of information about the content of the environment, provided we have e ective means for understanding it. we develop a set of algorithmic tools for extracting information from the link structures of such environments, and report on experiments that demonstrate their e ectiveness in a variety of contexts on the world wide web. the central issue we address within our framework is the distillation of broad search topics, through the discovery of \authoritative" information sources on such topics. we propose and test an algorithmic formulation of the notion of authority, based on the relationship between a set of relevant authoritative pages and the set of \hub pages" that join them together in the link structure. our formulation has connections to the eigenvectors of certain matrices associated with the link graph; these connections in turn motivate additional heuristics for link-based analysis. surrounding text:erent popularity according to their in-links. technologies such as pagerank [13] and hits [***] have been successfully applied to distinguish the popularity of di. erent web pages through analyzing the link structure in the web graph - [17] propose a unied link analysis framework called \link fusion" to consider both the inter- and intratype link structure among multi-type inter-related datan objects. the pagerank and hits algorithms [***] are shown to be special cases of the unied link analysis framework. although the paper mentioned some similar notion of our popularity propagation factor, however how to assign these factors is considered as the most important future research work in the paper influence:2 type:1,2 pair index:138 citer id:120 citer title:object-level ranking: bringing order to web objects citer abstract:in contrast with the current web search methods that essentially do document-level ranking and retrieval, we are exploring a new paradigm to enable web search at the object level. we collect web information for objects relevant for a specic application domain and rank these objects in terms of their relevance and popularity to answer user queries. traditional pagerank model is no longer valid for object popularity calculation because of the existence of heterogeneous relationships between objects. this paper introduces poprank, a domain-independent object-level link analysis model to rank the objects within a specic domain. specically we assign a popularity propagation factor to each type of object relationship, study how di?erent popularity propagation factors for these heterogeneous relationships could affect the popularity ranking, and propose ecient approaches to automatically decide these factors. our experiments are done using 1 million cs papers, and the experimental results show that poprank can achieve signicantly better ranking results than naively applying pagerank on the object graph citee id:124 citee title:wrapper induction for information extraction citee abstract:the internet presents numerous sources of useful information---telephone directories, product catalogs, stock quotes, weather forecasts, etc. recently, many systems have been built that automatically gather and manipulate such information on a user's behalf. however, these resources are usually formatted for use by people (e.g., the relevant content is embedded in html pages), so extracting their content is difficult. wrappers are often used for this purpose. a wrapper is a procedure for. surrounding text:we can imagine that if these objects can be extracted and integrated from the web, powerful object-level search engines can be built to meet users' information needs more precisely, especially for some specic domains. such a perspective has lead to signicant interests in research communities, and related technologies such as wrapper deduction [***, 2], web database schema matching [16, 8], and object identication on the web [15] have been developed in recent years. these techniques made it possible for us to extract and integrate all the related web information about the same object together as an information unit influence:3 type:3 pair index:139 citer id:120 citer title:object-level ranking: bringing order to web objects citer abstract:in contrast with the current web search methods that essentially do document-level ranking and retrieval, we are exploring a new paradigm to enable web search at the object level. we collect web information for objects relevant for a specic application domain and rank these objects in terms of their relevance and popularity to answer user queries. traditional pagerank model is no longer valid for object popularity calculation because of the existence of heterogeneous relationships between objects. this paper introduces poprank, a domain-independent object-level link analysis model to rank the objects within a specic domain. specically we assign a popularity propagation factor to each type of object relationship, study how di?erent popularity propagation factors for these heterogeneous relationships could affect the popularity ranking, and propose ecient approaches to automatically decide these factors. our experiments are done using 1 million cs papers, and the experimental results show that poprank can achieve signicantly better ranking results than naively applying pagerank on the object graph citee id:118 citee title:mining data records in web pages citee abstract:a large amount of information on the web is contained in regularly structured objects, which we call data records. such data records are important because they often present the essential information of their host pages, e.g., lists of products and services. it is useful to mine such data records in order to extract information from them to provide value-added services. existing approaches to solving this problem mainly include the manual approach, supervised learning, and automatic surrounding text:keeping clicking on the links between web pages) or through some web page search engines (or some combination)2. since the information about an object is normally represented as a block of a web page [***, 14]. the web popularity can be computed by considering the pagerank [4, 13] scores of web pages containing the object and the importance of the web page blocks [14, 5] influence:2 type:3 pair index:140 citer id:120 citer title:object-level ranking: bringing order to web objects citer abstract:in contrast with the current web search methods that essentially do document-level ranking and retrieval, we are exploring a new paradigm to enable web search at the object level. we collect web information for objects relevant for a specic application domain and rank these objects in terms of their relevance and popularity to answer user queries. traditional pagerank model is no longer valid for object popularity calculation because of the existence of heterogeneous relationships between objects. this paper introduces poprank, a domain-independent object-level link analysis model to rank the objects within a specic domain. specically we assign a popularity propagation factor to each type of object relationship, study how di?erent popularity propagation factors for these heterogeneous relationships could affect the popularity ranking, and propose ecient approaches to automatically decide these factors. our experiments are done using 1 million cs papers, and the experimental results show that poprank can achieve signicantly better ranking results than naively applying pagerank on the object graph citee id:106 citee title:learning block importance models for web pages citee abstract:some previous works show that a web page can be partitioned to multiple segments or blocks, and usually the importance of those blocks in a page is not equivalent. also, it is proved that differentiating noisy or unimportant blocks from pages can facilitate web mining, search and accessibility. but in these works, no uniform approach or model is presented to measure the importance of different portions in web pages. through a user study, we found that people do have a consistent view about the importance of blocks in web pages. in this paper, we investigate how to find a model to automatically assign importance values to blocks in a web page. we define the block importance estimation as a learning problem. first, we use the vips (vision-based page segmentation) algorithm to partition a web page into semantic blocks with a hierarchy structure. then spatial features (such as position, size) and content features (such as the number of images and links) are extracted to construct a feature vector for each block. based on these features, learning algorithms, such as svm and neural network, are applied to train various block importance models. in our experiments, the best model can achieve the performance with micro-f1 79% and micro-accuracy 85.9%, which is quite close to a persons surrounding text:keeping clicking on the links between web pages) or through some web page search engines (or some combination)2. since the information about an object is normally represented as a block of a web page [12, ***]. the web popularity can be computed by considering the pagerank [4, 13] scores of web pages containing the object and the importance of the web page blocks [***, 5] - since the information about an object is normally represented as a block of a web page [12, ***]. the web popularity can be computed by considering the pagerank [4, 13] scores of web pages containing the object and the importance of the web page blocks [***, 5]. we assume web databases will uniformly propagate its popularity (i influence:2 type:3 pair index:141 citer id:135 citer title:similarity measure and instance selection for collaborative filtering citer abstract:collaborative filtering has been very successful in both research and applications such as information filtering and e-commerce. the k-nearest neighbor (knn) method is a popular way for its realization. its key technique is to find k nearest neighbors for a given user to predict his interests. however, this method suffers from two fundamental problems: sparsity and scalability. in this paper, we present our solutions for these two problems. we adopt two techniques: a matrix conversion method for similarity measure and an instance selection method. and then we present an improved collaborative filtering algorithm based on these two methods. in contrast with existing collaborative algorithms, our method shows its satisfactory accuracy and performance citee id:16 citee title:automatic personalization based on web usage mining citee abstract:the ease and speed with which business transactions can be carried out over the web have been a key driving force in the rapid growth of e-commerce. the ability to track user browsing behavior down to individual mouse clicks has brought the vendor and end customer closer than ever before. it is now possible for vendors to personalize their product messages for individual customers on a massive scale, a phenomenon referred to as mass customization. of course, this type of surrounding text:it can actively recommend items to users according to their interests and behaviors. personalization systems for the web can be classified into three types: rule-based systems, content-based filtering systems and collaborative filtering systems [***]. rule-based systems, such as broadvision (www - 8. references [***] mobasher, b. , colley, r influence:3 type:3 pair index:142 citer id:135 citer title:similarity measure and instance selection for collaborative filtering citer abstract:collaborative filtering has been very successful in both research and applications such as information filtering and e-commerce. the k-nearest neighbor (knn) method is a popular way for its realization. its key technique is to find k nearest neighbors for a given user to predict his interests. however, this method suffers from two fundamental problems: sparsity and scalability. in this paper, we present our solutions for these two problems. we adopt two techniques: a matrix conversion method for similarity measure and an instance selection method. and then we present an improved collaborative filtering algorithm based on these two methods. in contrast with existing collaborative algorithms, our method shows its satisfactory accuracy and performance citee id:113 citee title:letizia: an agent that assists web browsing citee abstract:letizia is a user interface agent that assists a user browsing the world wide web. as the user operates a conventional web browser such as netscape, the agent tracks user behavior and attempts to anticipate items of interest by doing concurrent, autonomous exploration of links from the user's current position. the agent automates a browsing strategy consisting of a best-first search augmented by heuristics inferring user interest from browsing behavior. 1 introduction "letizia lvarez de surrounding text:com), allow website administrators to specify rules used to determine the contents served to individual users. content-based filtering systems, such as letizia [***], personal webwatcher [3], syskill & webert [4], associate every user with profiles to filter contents. collaborative filtering systems, such as grouplens [5], webwatcher [6] and lets browse [7], utilize the similarity among profiles of users to recommend interesting materials influence:3 type:3 pair index:143 citer id:135 citer title:similarity measure and instance selection for collaborative filtering citer abstract:collaborative filtering has been very successful in both research and applications such as information filtering and e-commerce. the k-nearest neighbor (knn) method is a popular way for its realization. its key technique is to find k nearest neighbors for a given user to predict his interests. however, this method suffers from two fundamental problems: sparsity and scalability. in this paper, we present our solutions for these two problems. we adopt two techniques: a matrix conversion method for similarity measure and an instance selection method. and then we present an improved collaborative filtering algorithm based on these two methods. in contrast with existing collaborative algorithms, our method shows its satisfactory accuracy and performance citee id:127 citee title:personal webwatcher: design and implementation citee abstract:with the growing availability of information sources, especially non-homogeneous, distributed sources like the world wide web, there is also a growing interest in tools that can help in making a good and quick selection of information we are interested in surrounding text:com), allow website administrators to specify rules used to determine the contents served to individual users. content-based filtering systems, such as letizia [2], personal webwatcher [***], syskill & webert [4], associate every user with profiles to filter contents. collaborative filtering systems, such as grouplens [5], webwatcher [6] and lets browse [7], utilize the similarity among profiles of users to recommend interesting materials influence:3 type:3 pair index:144 citer id:135 citer title:similarity measure and instance selection for collaborative filtering citer abstract:collaborative filtering has been very successful in both research and applications such as information filtering and e-commerce. the k-nearest neighbor (knn) method is a popular way for its realization. its key technique is to find k nearest neighbors for a given user to predict his interests. however, this method suffers from two fundamental problems: sparsity and scalability. in this paper, we present our solutions for these two problems. we adopt two techniques: a matrix conversion method for similarity measure and an instance selection method. and then we present an improved collaborative filtering algorithm based on these two methods. in contrast with existing collaborative algorithms, our method shows its satisfactory accuracy and performance citee id:137 citee title:syskill & webert: identifying interesting web sites citee abstract:we describe syskill & webert, a software agent that learns to rate pages on the world wide web (www), deciding what pages might interest a user. the user rates explored pages on a three point scale, and syskill & webert learns a user profile by analyzing the information on each page. the user profile can be used in two ways. first, it can be used to suggest which links a user would be interested in exploring. second, it can be used to construct a lycos query to find pages that would interest a user. we compare six different algorithms from machine learning and information retrieval on this task. we find that the naive bayesian classifier offers several advantages over other learning algorithms on this task. furthermore, we find that an initial portion of a web page is sufficient for making predictions on its interestingness substantially reducing the amount of network transmission required to make predictions surrounding text:com), allow website administrators to specify rules used to determine the contents served to individual users. content-based filtering systems, such as letizia [2], personal webwatcher [3], syskill & webert [***], associate every user with profiles to filter contents. collaborative filtering systems, such as grouplens [5], webwatcher [6] and lets browse [7], utilize the similarity among profiles of users to recommend interesting materials influence:3 type:3 pair index:145 citer id:135 citer title:similarity measure and instance selection for collaborative filtering citer abstract:collaborative filtering has been very successful in both research and applications such as information filtering and e-commerce. the k-nearest neighbor (knn) method is a popular way for its realization. its key technique is to find k nearest neighbors for a given user to predict his interests. however, this method suffers from two fundamental problems: sparsity and scalability. in this paper, we present our solutions for these two problems. we adopt two techniques: a matrix conversion method for similarity measure and an instance selection method. and then we present an improved collaborative filtering algorithm based on these two methods. in contrast with existing collaborative algorithms, our method shows its satisfactory accuracy and performance citee id:28 citee title:grouplens: applying collaborative filtering to usenet news citee abstract:the grouplens project designed, implemented, and evaluated a collaborative filtering system for usenet newsa high-volume, high-turnover discussion list service on the internet. usenet newsgroupsthe individual discussion listsmay carry hundreds of messages each day. while in theory the newsgroup organization allows readers to select the content that most interests them, in practice surrounding text:content-based filtering systems, such as letizia [2], personal webwatcher [3], syskill & webert [4], associate every user with profiles to filter contents. collaborative filtering systems, such as grouplens [***], webwatcher [6] and lets browse [7], utilize the similarity among profiles of users to recommend interesting materials. collaborative filtering has been very successful in both research and applications such as information filtering and e-commerce influence:1 type:2 pair index:146 citer id:135 citer title:similarity measure and instance selection for collaborative filtering citer abstract:collaborative filtering has been very successful in both research and applications such as information filtering and e-commerce. the k-nearest neighbor (knn) method is a popular way for its realization. its key technique is to find k nearest neighbors for a given user to predict his interests. however, this method suffers from two fundamental problems: sparsity and scalability. in this paper, we present our solutions for these two problems. we adopt two techniques: a matrix conversion method for similarity measure and an instance selection method. and then we present an improved collaborative filtering algorithm based on these two methods. in contrast with existing collaborative algorithms, our method shows its satisfactory accuracy and performance citee id:125 citee title:webwatcher: a tour guide for the world wide web citee abstract:we explore the notion of a tour guide software agent for assisting users browsing the world wide web. a web tour guide agent provides assistance similar to that provided by a human tour guide in a museum-- it guides the user along an appropriate path through the collection, based on its knowledge of the user's interests, of the location and relevance of various items in the collection, and of the way in which others have interacted with the collection in the past. this paper describes a simple but operational tour guide, called webwatcher, which has given over 5000 tours to people browsing cmu's school of computer science web pages. webwatcher accompanies users from page to page, suggests appropriate hyperlinks, and learns from experience to improve its advice-giving skills. we describe the learning algorithms used by webwatcher, experimental results showing their effectiveness, and lessons learned from this case study in web tour guide agents surrounding text:content-based filtering systems, such as letizia [2], personal webwatcher [3], syskill & webert [4], associate every user with profiles to filter contents. collaborative filtering systems, such as grouplens [5], webwatcher [***] and lets browse [7], utilize the similarity among profiles of users to recommend interesting materials. collaborative filtering has been very successful in both research and applications such as information filtering and e-commerce influence:3 type:3 pair index:147 citer id:135 citer title:similarity measure and instance selection for collaborative filtering citer abstract:collaborative filtering has been very successful in both research and applications such as information filtering and e-commerce. the k-nearest neighbor (knn) method is a popular way for its realization. its key technique is to find k nearest neighbors for a given user to predict his interests. however, this method suffers from two fundamental problems: sparsity and scalability. in this paper, we present our solutions for these two problems. we adopt two techniques: a matrix conversion method for similarity measure and an instance selection method. and then we present an improved collaborative filtering algorithm based on these two methods. in contrast with existing collaborative algorithms, our method shows its satisfactory accuracy and performance citee id:77 citee title:fab: content-based, collaborative recommendation citee abstract:online readers are in need of tools to help them cope with the mass of content available on the world-wide web. in traditional media, readers are provided assistance in making selections. this includes both implicit assistance in the form of editorial oversight and explicit assistance in the form of recommendation services such as movie reviews and restaurant guides. the electronic medium offers new opportunities to create recommendation services, ones that adapt over time to track their evolving interests. fab is such a recommendation system for the web, and has been operational in several versions since december 1994. surrounding text:in scalability aspect, the nearest neighbor algorithm suffers serious scalability problems in failing to scale up its computation with the growth of both the number of users and the number of items. to solve the first problem, balabanovic et al [***] and claypool et al [9] put forward a content-based collaborative filtering method, which utilizes the contents browsed by users to compute the similarity among users. sarwar et al [10] uses latent semantic indexing (lsi) to capture the similarity among users and items in a reduced dimensional space influence:1 type:2 pair index:148 citer id:135 citer title:similarity measure and instance selection for collaborative filtering citer abstract:collaborative filtering has been very successful in both research and applications such as information filtering and e-commerce. the k-nearest neighbor (knn) method is a popular way for its realization. its key technique is to find k nearest neighbors for a given user to predict his interests. however, this method suffers from two fundamental problems: sparsity and scalability. in this paper, we present our solutions for these two problems. we adopt two techniques: a matrix conversion method for similarity measure and an instance selection method. and then we present an improved collaborative filtering algorithm based on these two methods. in contrast with existing collaborative algorithms, our method shows its satisfactory accuracy and performance citee id:44 citee title:combining content-based and collaborative filters in an online newspaper citee abstract:the explosive growth of mailing lists, web sites and usenet news demands effective filtering solutions. collaborative filtering combines the informed opinions of humans to make personalized, accurate predictions. content-based filtering uses the speed of computers to make complete, fast predictions. in this work, we present a new filtering approach that combines the coverage and speed of content-filters with the depth of collaborative filtering. we apply our research approach to an online newspaper, an as yet untapped opportunity for filters useful to the wide-spread news reading populace. we present the design of our filtering system and describe the results from preliminary experiments that suggest merits to our approach. surrounding text:in scalability aspect, the nearest neighbor algorithm suffers serious scalability problems in failing to scale up its computation with the growth of both the number of users and the number of items. to solve the first problem, balabanovic et al [8] and claypool et al [***] put forward a content-based collaborative filtering method, which utilizes the contents browsed by users to compute the similarity among users. sarwar et al [10] uses latent semantic indexing (lsi) to capture the similarity among users and items in a reduced dimensional space influence:1 type:2 pair index:149 citer id:135 citer title:similarity measure and instance selection for collaborative filtering citer abstract:collaborative filtering has been very successful in both research and applications such as information filtering and e-commerce. the k-nearest neighbor (knn) method is a popular way for its realization. its key technique is to find k nearest neighbors for a given user to predict his interests. however, this method suffers from two fundamental problems: sparsity and scalability. in this paper, we present our solutions for these two problems. we adopt two techniques: a matrix conversion method for similarity measure and an instance selection method. and then we present an improved collaborative filtering algorithm based on these two methods. in contrast with existing collaborative algorithms, our method shows its satisfactory accuracy and performance citee id:79 citee title:feature weighting and instance selection for collaborative filtering citee abstract:collaborative filtering uses a database about consumers preferences to make personal product recommendations and is achieving widespread success in e-commerce nowadays. in this paper, we present several feature-weighting methods to improve the accuracy of collaborative filtering algorithms. furthermore, we propose to reduce the training data set by selecting only highly relevant instances. we evaluate various methods on the well-known eachmovie data set. our experimental results show that mutual information achieves the largest accuracy gain among all feature-weighting methods. the most interesting fact is that our data reduction method even achieves an improvement of the accuracy of about 6% while speeding up the collaborative filtering algorithm by a factor of 15 surrounding text:sarwar et al [10] uses latent semantic indexing (lsi) to capture the similarity among users and items in a reduced dimensional space. yu et al [***] uses a feature-weighting method to improve the accuracy of collaborative filtering algorithms. to solve the second problem, many studies bring forward model-based methods that use the users preferences to learn a model, which is then used for predications - similarity measure there exists relationship between items. yu et al [***] computes dependencies between items using mutual information method and presents a feature weighting method to compute the similarity between users. and sarwar et al [13] computes similarities between items and uses them straightforward for prediction - 3. 3 sensitivity of instance selection those irrelevant instances have a significant impact on the prediction quality [***]. we evaluated the quality of our method of selecting relevant instances - 4 comparison experiments 020040060080010000100020003000400050006000size of training set (# of users)time (sec. )user-baseditem-basedcluster-basedfeature-weightingclass-based we implemented five different collaborative filtering algorithms: user-based algorithm, item-based algorithm [13], cluster-based algorithm [12], feature-weighting algorithm [***], and our algorithm that is class-based algorithm. the results of comparison of accuracy between these five algorithms are shown in figure 7 influence:1 type:2 pair index:150 citer id:135 citer title:similarity measure and instance selection for collaborative filtering citer abstract:collaborative filtering has been very successful in both research and applications such as information filtering and e-commerce. the k-nearest neighbor (knn) method is a popular way for its realization. its key technique is to find k nearest neighbors for a given user to predict his interests. however, this method suffers from two fundamental problems: sparsity and scalability. in this paper, we present our solutions for these two problems. we adopt two techniques: a matrix conversion method for similarity measure and an instance selection method. and then we present an improved collaborative filtering algorithm based on these two methods. in contrast with existing collaborative algorithms, our method shows its satisfactory accuracy and performance citee id:35 citee title:empirical analysis of predictive algorithms for collaborative filtering citee abstract:collaborative filtering or recommender systems use a database about user preferences to predict additional topics or products a new user might like. in this paper we describe several algorithms designed for this task, including techniques based on correlation coefficients, vector-based similarity calculations, and statistical bayesian methods. we compare the predictive accuracy of the various methods in a set of representative problem domains. we use two basic classes of evaluation metrics. the first characterizes accuracy over a set of individual predictions in terms of average absolute deviation. the second estimates the utility of a ranked list of suggested items. this metric uses an estimate of the probability that a user will see a recommendation in an ordered list. experiments were run for datasets associated with 3 application areas, 4 experimental protocols, and the 2 evaluation metrics for the various algorithms. results indicate that for a wide range of conditions, bayesian networks with decision trees at each node and correlation methods outperform bayesian-clustering and vector-similarity methods. between correlation and bayesian networks, the preferred method depends on the nature of the dataset, nature of the application (ranked versus one-by-one presentation), and the availability of votes with which to make predictions. other considerations include the size of database, speed of predictions, and learning time. surrounding text:to solve the second problem, many studies bring forward model-based methods that use the users preferences to learn a model, which is then used for predications. breese et al [***] utilizes clustering and bayesian network approaches. its results show that the clustering-based method is the most efficient but suffering from poor accuracy - collaboration filtering the task of collaborative filtering is to predict the preference of an active user to a target item based on user preferences. there are two general classes of collaborative filtering algorithms: memory-based methods and model-based methods [***]. 2 - the model can be built off-line over several hours or days. the resulting model is very small, very fast, and essentially as accurate as memory-based methods [***]. model-based methods may prove practical for environments in which user preferences change slowly with respect to the time needed to build the model - 4 compute the mean absolute error of prediction and output. as in [***], we also employ two protocols, all but 1, and given k. in the first class, we randomly withhold a single randomly selected vote for each test user, and try to predict its value given all the other votes the user has voted on - 4 comparison experiments 020040060080010000100020003000400050006000size of training set (# of users)time (sec. )user-baseditem-basedcluster-basedfeature-weightingclass-based we implemented five different collaborative filtering algorithms: user-based algorithm, item-based algorithm [13], cluster-based algorithm [***], feature-weighting algorithm [11], and our algorithm that is class-based algorithm. the results of comparison of accuracy between these five algorithms are shown in figure 7 influence:1 type:2 pair index:151 citer id:135 citer title:similarity measure and instance selection for collaborative filtering citer abstract:collaborative filtering has been very successful in both research and applications such as information filtering and e-commerce. the k-nearest neighbor (knn) method is a popular way for its realization. its key technique is to find k nearest neighbors for a given user to predict his interests. however, this method suffers from two fundamental problems: sparsity and scalability. in this paper, we present our solutions for these two problems. we adopt two techniques: a matrix conversion method for similarity measure and an instance selection method. and then we present an improved collaborative filtering algorithm based on these two methods. in contrast with existing collaborative algorithms, our method shows its satisfactory accuracy and performance citee id:95 citee title:instance selection techniques for memory-based collaborative filtering citee abstract:collaborative filtering (cf) has become an important data mining technique to make personalized recommendations for books, web pages or movies, etc. one popular algorithm is the memory-based collaborative filtering, which predicts a users preference based on his or her similarity to other users (instances) in the database. however, the tremendous growth of users and the large number of products, memory-based cf algorithms results in the problem of deciding the right instances to use during prediction, in order to reduce executive cost and excessive storage, and possibly to improve the generalization accuracy by avoiding noise and overfitting. in this paper, we focus our work on a typical user preference database that contains many missing values, and propose four novel instance reduction techniques called turf1-turf4 as a preprocessing step to improve the efficiency and accuracy of the memory-based cf algorithm. the key idea is to generate prediction from a carefully selected set of relevant instances. we evaluate the techniques on the well-known eachmovie data set. our experiments showed that the proposed algorithms not just dramatically speed up the prediction, but also improved the accuracy. surrounding text:sarwar et al [14] presents a rule-based approach using association rule discovery algorithms to find association between relevant items and then generates item recommendation based on the strength of the association between items. yu et al [***] solves the second problem from another point of view. it adopts a technique of instance selection to remove the irrelevant and redundant instances from the training set influence:1 type:2 pair index:152 citer id:135 citer title:similarity measure and instance selection for collaborative filtering citer abstract:collaborative filtering has been very successful in both research and applications such as information filtering and e-commerce. the k-nearest neighbor (knn) method is a popular way for its realization. its key technique is to find k nearest neighbors for a given user to predict his interests. however, this method suffers from two fundamental problems: sparsity and scalability. in this paper, we present our solutions for these two problems. we adopt two techniques: a matrix conversion method for similarity measure and an instance selection method. and then we present an improved collaborative filtering algorithm based on these two methods. in contrast with existing collaborative algorithms, our method shows its satisfactory accuracy and performance citee id:33 citee title:grouplens: an open architecture for collaborative filtering of netnews citee abstract:collaborative filters help people make choices based on the opinions of other people. grouplens is a system for collaborative filtering of netnews, to help people find articles they will like in the huge stream of available articles. news reader clients display predicted scores and make it easy for users to rate articles after they read them. rating servers, called better bit bureaus, gather and disseminate the ratings. the rating servers predict scores based on the heuristic that people who surrounding text:2. 1 memory-based algorithm the memory-based algorithm [***] is the most popular prediction technique in collaborative filtering applications. its basic idea is to compute the active users predicted vote of an item as a weighted average of the votes given to that item by other users influence:1 type:2 pair index:153 citer id:135 citer title:similarity measure and instance selection for collaborative filtering citer abstract:collaborative filtering has been very successful in both research and applications such as information filtering and e-commerce. the k-nearest neighbor (knn) method is a popular way for its realization. its key technique is to find k nearest neighbors for a given user to predict his interests. however, this method suffers from two fundamental problems: sparsity and scalability. in this paper, we present our solutions for these two problems. we adopt two techniques: a matrix conversion method for similarity measure and an instance selection method. and then we present an improved collaborative filtering algorithm based on these two methods. in contrast with existing collaborative algorithms, our method shows its satisfactory accuracy and performance citee id:2 citee title:a probabilistic analysis of the rocchio algorithm with tfidf for text categorization citee abstract:the rocchio relevance feedback algorithm is one of the most popular and widely applied learning methods from information retrieval. here, a probabilistic analysis of this algorithm is presented in a text categorization framework. the analysis gives theoretical insight into the heuristics used in the rocchio algorithm, particularly the word weighting scheme and the similarity metric. it also suggests improvements which lead to a probabilistic variant of the rocchio classifier. the rocchio classifier, its probabilistic variant, and a naive bayes classifier are compared on six text categorization tasks. the results show that the probabilistic algorithms are preferable to the heuristic rocchio classifier not only because they are more well-founded, but also because they achieve better performance. surrounding text:so we first classify all items. for instance, for those text-formatting items, we can use the naive bayes method to carry out classification [***]. and then we convert the user-item matrix into a user-class matrix influence:3 type:3,1 pair index:154 citer id:135 citer title:similarity measure and instance selection for collaborative filtering citer abstract:collaborative filtering has been very successful in both research and applications such as information filtering and e-commerce. the k-nearest neighbor (knn) method is a popular way for its realization. its key technique is to find k nearest neighbors for a given user to predict his interests. however, this method suffers from two fundamental problems: sparsity and scalability. in this paper, we present our solutions for these two problems. we adopt two techniques: a matrix conversion method for similarity measure and an instance selection method. and then we present an improved collaborative filtering algorithm based on these two methods. in contrast with existing collaborative algorithms, our method shows its satisfactory accuracy and performance citee id:105 citee title:knowledge discovery in large spatial databases: focusing techniques for efficient class identification citee abstract:both, the number and the size of spatial databases are rapidly growing because of the large amount of data obtained from satellite images, x-ray crystallography or other scientific equipment. therefore, automated knowledge discovery becomes more and more important in spatial databases. so far, most of the methods for knowledge discovery in databases (kdd) have been based on relational database systems. in this paper, we address the task of class identification in spatial databases using clustering techniques. we put special emphasis on the integration of the discovery methods with the db interface, which is crucial for the efficiency of kdd on large databases. the key to this integration is the use of a well-known spatial access method, the r*-tree. the focusing component of a kdd system determines which parts of the database are relevant for the knowledge discovery task. we present several strategies for focusing: selecting representatives from a spatial database, focusing on the relevant clusters and retrieving all objects of a given cluster. we have applied the proposed techniques to real data from a large protein database used for predicting protein-protein docking. a performance evaluation on this database indicates that clustering on large spatial databases can be performed, both, efficiently and effectively. this research was funded by the german minister for research and technology (bmft) under grant no. 01 ib 307 b. the authors are responsible for the contents of this paper surrounding text:to respond to these challenges, we present an instance selection method. different from other random sampling or data focusing techniques [***], the method can not only reduce the size of training set but also improve the accuracy of prediction. 4 influence:3 type:2 pair index:155 citer id:135 citer title:similarity measure and instance selection for collaborative filtering citer abstract:collaborative filtering has been very successful in both research and applications such as information filtering and e-commerce. the k-nearest neighbor (knn) method is a popular way for its realization. its key technique is to find k nearest neighbors for a given user to predict his interests. however, this method suffers from two fundamental problems: sparsity and scalability. in this paper, we present our solutions for these two problems. we adopt two techniques: a matrix conversion method for similarity measure and an instance selection method. and then we present an improved collaborative filtering algorithm based on these two methods. in contrast with existing collaborative algorithms, our method shows its satisfactory accuracy and performance citee id:36 citee title:latent class models for collaborative filtering citee abstract:this paper presents a statistical approach to collaborative filtering and investigates the use of latent class models for predicting individual choices and preferences based on observed preference behavior. two models are discussed and compared: the aspect model, a probabilistic latent space model which models individual preferences as a convex combination of preference factors, and the two-sided clustering model, which simultaneously partitions persons and objects into clusters. we present em algorithms for different variants of the aspect model and derive an approximate em algorithm based on a variational principle for the two-sided clustering model. the benefits of the different models are experimentally investigated on a large movie data set. surrounding text:1 relevancy between users and items hofmann [20] proposes an aspect model, a latent class statistical mixture model, for associating word-document co-occurrence data with a set of latent variables. hofmann et al [***] applies the aspect model to user-item co-occurrence data for collaborative filtering. in the aspect model, users u u={u1, , un}, together with the items they rate i i={i1, , im}, form observation (u, i), which are associated with one of the latent class variable c c={c1, , ck} - for those text-formatting items, we can use the naive bayes method to carry out computation. hofmann et al [***] presents an expectation maximization (em) algorithm for maximum likelihood estimation of these parameters. however, em algorithm is not suitable here because the knowledge of classification of items obtained from the user-item matrix is different from the content-based classification of items influence:1 type:2