LARGE-SCALE LOCATION PREDICTION FOR WEB PAGES
Location information of Web pages plays an important role in location-sensitive tasks such as Web search ranking forlocation-sensitive queries. However, such information is usually ambiguous, incomplete or even missing, which raises the problem oflocation prediction for Web pages. Meanwhile, Web pages are massive and often noisy, which pose challenges to the majority ofexisting algorithms for location prediction. In this paper, we propose a novel and scalable location prediction framework for Web pagesbased on the query-URL click graph. In particular, we introduce a concept of term location vectors to capture location distributions forall terms and develop an automatic approach to learn the importance of each term location vector for location prediction. Empiricalresults on a large URL set demonstrate that the proposed framework signiﬁcantly improves the location prediction accuracy comparingwith various representative baselines. We further provide a principled way to incorporate the proposed framework into the searchranking task and experimental results on a commercial search engine show that the proposed method remarkably boosts the rankingperformance for location-sensitive queries.
Wang et al. identify three different directions of locationprediction for Web pages – provider location, content locationand serving location. This paper focuses on predictingcontent locations of Web pages. Generally, a large mountof research on Web page location prediction have beenusing Name Entity Recognition (NER) to extractlocation entities, applying location knowledge base to verify the extracted locations, and furtherexploring machine learning models to improve locationextraction and reduce ambiguity.Amitay et al. focus on reducing the ambiguitiesfor location terms that also have non-location meanings.They extract all locations in Web pages, disambiguate thelocation terms and make a location decision for the page.Wang et al. identify the locations of Web content basedon the knowledge base Gazetteer, and use a top-downdominant location detection algorithm, which traverses thegeographical tree starting from the root and continues toexamine its children nodes till it satisﬁes some conditionsfor the current node. However, the compilation of suchgazetteers is sometimes mentioned as a bottleneck in thedesign of NER systems. Qin et al. further study theinﬂuence of prior knowledge for location extraction. Theyleverage location prior and linguistic context prior to infer aset of possible locations, formulate the location predictionas a ranking problem of the candidates and iterativelyupdate the location set till converge. This work utilizes richsemantic information and improves the performance of locationextraction signiﬁcantly. However, these work extractlocations based on the location terms and their contextualinformation while ignoring the non-location terms that mayalso contain location information.
In this paper, we explore both location and non-locationterms for location prediction from a new perspective andpropose a novel location prediction framework based onthe query-URL click graph (Section 2), which tackles allthree challenges simultaneously. We introduce a conceptof term location vectors to model the relations betweenterms and locations via the query-URL click graph (detailsin Section 3). Each term is represented as a vector overall locations, and the weight of a location in the vectorcorresponds to the conﬁdence of this term relevant to thelocation. Such term location vectors not only model boththe accurate locations and the related locations for locationterms, but also capture the hidden location informationfor non-location terms. Thus it plays an important rolein reducing the ambiguity in location prediction for Webpages. To mitigate the noise from the long and noisy contentof the Web page, we also propose to propagate the queryterms through the query-URL click graph to represent thecontent of URLs (details in Section 3). This new representationnaturally bridges the semantic gap between queriesand Web page content, and encodes rich contextualinformation from queries and users’ click behaviors forlocation prediction.
In this paper, we propose an efﬁcient and scalable algorithmfor Web page location prediction by leveraging the query-URL click graphs—we model each term as a distributionover all locations to capture the location information forboth location and non-location terms; and further developan approach to learn a weight for each term location vectorautomatically to capture its impact on location prediction.This proposed framework introduces a novel concept ofterm location vectors, which enable us to incorporate richcontextual information and explore various types of URLcontent. In addition to its scalability and ﬂexibility, thewhole framework is based on the query-URL click graphsand no extra human labels are needed. Experimental resultsfrom a commercial search engine demonstrate that(1) the proposed framework can accurately predict locationsof Web pages; and (2) the proposed framework canbe incorporated into a novel location-boosting frameworkto signiﬁcantly improve the search relevance performancefor location-sensitive queries. There are several interestingdirections for further investigations.First, this framework can be further improved by introducinggeographical prior knowledge, temporal awareness, better query location extraction techniques,and building up term location vectors for each sense ofterms. In addition to the weighted click text for URLs,other heuristic weights like term frequency-inverse documentfrequency (tf-idf)  can be considered to ﬁlterout the noise of URL page content and weight the terms.Anchor text for URLs can also be incorporated for betterrepresenting URL page content.Second, these learned term location vectors can also beextended to predict query locations. Similar to the termsof URL pages, the queries also contain location and nonlocationterms, and both may contain useful location cluesfor location prediction. Different from the traditional methodswhich focus on the query words, we canextract the location information from both location and nonlocationterms to predict the query locations.Finally, the proposed location boosting model shows apromising direction to incorporate location features intolocation-sensitive applications. While the learning-to-rankmodel buries the distance feature due to the low coverage,this location boosting model considers the distance featureand the base relevance at the same time for the ﬁnal ranking.Similar distance function can be deﬁned to capture otherimportant dimensions in ranking, including popularity,recency, personalization, etc., and new boostingmodels can be developed accordingly for balancing thedistance function and the base relevance. The results canbe further improved by training query speciﬁc boosting parameters.In addition, when to trigger the location boostingremains an open question for further exploration.
 E. Amitay, N. Har’El, R. Sivan, and A. Soffer. Web-a-where:geotagging web content. In ACM SIGIR conference on Research andDevelopment in Information Retrieval, 2004.
 L. Backstrom, J. Kleinberg, R. Kumar, and J. Novak. Spatialvariation in search engine queries. In International conference onWorld Wide Web, 2008.
 C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton,and G. Hullender. Learning to rank using gradient descent. InProceedings of the 22nd international conference on Machine learning,pages 89–96, 2005.
 Z. Cao, T. Qin, T.-Y. Liu, M.-F. Tsai, and H. Li. Learning to rank:from pairwise approach to listwise approach. In InternationalConference on Machine learning, pages 129–136, 2007.
 Z. Cheng, J. Caverlee, and K. Lee. You are where you tweet: Acontent-based approach to geo-locating twitter users. In ACMInternational Conference on Information and Knowledge Management,2010.
 J. Cho and S. Roy. Impact of search engines on page popularity. InInternational conference on World Wide Web, 2004.
 H. Deng, I. King, and M. R. Lyu. Entropy-biased models for queryrepresentation on the click graph. In ACM SIGIR conference onResearch and Development in Information Retrieval, 2009.
 A. Dong, Y. Chang, Z. Zheng, G. Mishne, J. Bai, R. Zhang, K. Buchner,C. Liao, and F. Diaz. Towards recency ranking in web search.In ACM international conference on Web search and data mining, 2010.
 J. Eisenstein, B. O’Connor, N. A. Smith, and E. P. Xing. A latentvariable model for geographic lexical variation. In Conference onEmpirical Methods in Natural Language Processing, 2010.
 J. R. Finkel, T. Grenager, and C. Manning. Incorporating nonlocalinformation into information extraction systems by gibbssampling. In Annual Meeting on Association for ComputationalLinguistics, 2005.