SCALABLE ALGORITHMS FOR CQA POST VOTING PREDICTION

 

ABSTRACT

Community Question Answering (CQA) sites, such as Stack Overflow and Yahoo! Answers, have become verypopular in recent years. These sites contain rich crowdsourcing knowledge contributed by the site users in the form of questionsand answers, and these questions and answers can satisfy the information needs of more users. In this article, we aim atpredicting the voting scores of questions/answers shortly after they are posted in the CQA sites. To accomplish this task, weidentify three key aspects that matter with the voting of a post, i.e., the non-linear relationships between features and output,the question and answer coupling, and the dynamic fashion of data arrivals. A family of algorithms are proposed to model theabove three key aspects. Some approximations and extensions are also proposed to scale up the computation. We analyze theproposed algorithms in terms of optimality, correctness, and complexity. Extensive experimental evaluations conducted on tworeal data sets demonstrate the effectiveness and efficiency of our algorithms.

PROPOSED SYSTEM:

we propose a family of algorithmsfor predicting the voting scores of question/answerposts. Our algorithms enjoy three key advantages. First,they are comprehensive in the sense that our model naturallycaptures all the above three key aspects (i.e., nonlinearity,coupling, and dynamics) that matter with thevoting score of a post. Second, they are flexible and general,being able to fade the effects of old examples, select asubset of features/examples, and handle some special cases(i.e., when one or two aspects are prominent). Third, theyare scalable and adaptive to the newly arrived examples.For example, one of our algorithms (LIP-KIMAA) has asub-linear complexity in both time and space. On the StackOverflow data set with more than 3 million posts, thisLIP-KIMAA algorithm can build and update our modelin seconds, while straightforward approaches would takehours.In summary, the main contributions of this paper include:_ A family of novel algorithms for the prediction ofthe voting scores of questions/answers in CQA sites.The algorithms can handle the three aspects of nonlinearity,coupling, and dynamics. They can also fadethe effects of old posts and learn a sparse set offeatures/examples. Proofs and analysis show the optimality,correctness, and computational efficiency, aswell as the intrinsic relationships a differentalgorithms._ Extensive empirical evaluations, demonstrating the effectivenessand efficiency of our algorithms. For example,compared with alternative choices one of our proposed algorithms (LIP-KIMAA) (1)leads up to 35.8% improvement in prediction accuracy;(2) and is up to 390x faster while enjoying sub-linearscalability.

EXISTING SYSTEM:

RELATED WORKIn this section, we briefly review related work includingmining CQA sites and mining stream data.Mining CQA Sites: There is a large body of existingwork on mining CQA sites. For example, Li et al. aim to predict question quality, which is defined as thecombination of user attention, answer attempts and thearrival speed of the best answer. Ravi et al. also studythe question quality which is combined by voting scoreand pageviews. Jeon et al. and Suryanto et al. evaluate the usefulness of answer quality and incorporate itto improve retrieval performance. To predict the quality ofboth questions and answers, Agichtein et al. develop agraph-based model to catch the relationships a users,Li et al.adopt the co-training approach to employ bothquestion features and answer features, and Bian et al. [24]propose to propagate the labels through user-questionanswergraph so as to tackle the sparsity problem whereonly a small number of questions/answers are labeled.Recently, Anderson et al. propose to predict the longlastingvalue (i.e., the pageviews) of a question and itsanswers. Dror et al.  aim to predict whether a questionwill be answered or not. How to predict the answer thatthe questioner will probably choose as the accepted answeris also well studied . Overall, ourwork differs from these existing work at the methodologylevel. While most of the existing work treats the predictionproblem as a single, and/or linear, and/or static problem,

CONCLUSIONS

In this aricle, we have proposed a family of algorithms tocomprehensively and efficiently predict the voting scoresof questions/answers in CQA sites. In particular, someof the proposed algorithms (LIP-KIM, LIP-KIMA, andLIP-KIMAA) can capture three key aspects (non-linearity,coupling, and dynamics) that matter with the voting scoreof a post, while others can handle the special cases whenonly a fraction of the three aspects are prominent. Interms of computation efficiency, some algorithms (LIP-IM,LIP-IMF, LIP-KIA, LIP-KIMAA, and LIP-KIMAA) enjoylinear, sub-linear, or even constant scalability. The proposedalgorithms are also able to fade the effects of old examples(LIP-IMF), and select a subset of features/examples (LIPMSand LIP-KMS). We analyze our algorithms in termsof optimality, correctness, and complexity, and reveal theintrinsic relationships a different algorithms. We conductextensive experimental evaluations on two real datasets to demonstrate the effectiveness and efficiency of ourapproaches.

REFERENCES

[1] T. Osbourn, “Getting the most out of the web,” Software, IEEE,vol. 28, no. 1, pp. 96–96, 2011.

[2] Y. Yao, H. Tong, T. Xie, L. Akoglu, F. Xu, and J. Lu, “Joint votingprediction for questions and answers in cqa,” in ASONAM, 2014.

[3] ——, “Want a good answer? ask a good question first!” arXivpreprint arXiv:1311.6876, 2013.

[4] C. Saunders, A. Gammerman, and V. Vovk, “Ridge regressionlearning algorithm in dual variables,” in ICML, 1998, pp. 515–521.

[5] S. S. Haykin, Adaptive filter theory, 2005.

[6] Y. Engel, S. Mannor, and R. Meir, “The kernel recursive least-squaresalgorithm,” IEEE Transactions on Signal Processing, vol. 52, no. 8,pp. 2275–2285, 2004.

[7] N. Aronszajn, “Theory of reproducing kernels,” Transactions of theAmerican mathematical society, vol. 68, no. 3, pp. 337–404, 1950.

[8] C. J. Burges, “A tutorial on support vector machines for patternrecognition,” Data mining and knowledge discovery, vol. 2, no. 2,pp. 121–167, 1998.

[9] P. Drineas and M. W. Mahoney, “On the nystr¨om method forapproximating a gram matrix for improved kernel-based learning,”The Journal of Machine Learning Research, vol. 6, pp. 2153–2175,2005.

[10] G. Golub and C. Van Loan, “Matrix computations,” 1996.

[11] W. W. Piegorsch and G. Casella, “Inverting a sum of matrices,” SIAMReview, vol. 32, no. 3, pp. 470–470, 1990.

[12] H. Zou and T. Hastie, “Regularization and variable selection viathe elastic net,” Journal of the Royal Statistical Society: Series B(Statistical Methodology), vol. 67, no. 2, pp. 301–320, 2005.