호기심, 관심사

페이지랭크(Pagerank) 알고리즘 개념잡기

정리하면 구글 검색의 기본원리는 PageRank 알고리즘에 따라 미리 웹페이지들을 정렬해놓고, 검색을 하는 순간 그 검색어가 포함된 페이지들을 위 순위대로 나열해서 보여주는 것.

Out main goal is to improve the quality of web search engines. In 1994, some people believed that a complete search index would make it possible to find anything easily. According to Best of the Web 1994 -- Navigators, "The best navigation service should make it easy to find almost anything on the Web(once all the data is entered)." However, the Web of 1997 is quite different. Anyone who has used a search engine recently, can readily testify that the completeness of the index is not the only factor in the quality of search results. "Junk results" often wash out any results that a user is interested in. In fact, as of November 1997, only one of the top four commercial search engines finds itself (returns its own search page in response to its name in the top ten results). One of the main cause of this problem is that the number of documents in the indices has been increasing by many orders of magnitude, but the user's ability to look at documents has not. People are still only willing to look at the first few tens of results. Because of this, as the collection size grows, we need tools that have very high precision ( number of relevant documents returned, say in the top ten of results). Indeed, we want our notion of "relevant" to only inculde the very best documents since there even at the expense of recall ( the total number of relevant documents the system is able to return ). There is quite a bit of recent optimism that the use of more hypertextual information for making relevance judgments and other applications [Marchiori 97], [Spertus 97], [Weiss 96], [Kleinberg 98]. In particular, link structure [Page 98] and link text provide a lot of information for making relevance judgements and quality filtering. Google makes use of both link structure and anchor text .

페이지랭크 알고리즘을 연구한 배경은 단순했다. 단순검색엔진의 품질을 향상시키는 것. 당시 갈수록 정보량이 많아진 반면, 검색결과엔 쓰레기가 많아지고 부정확했음. 사람들은 첫 페이지의 top ten 검색결과 위주만 보기 때문에 정확한 검색결과를 보여주는 것은 무엇보다 중요했다.

들어가면서 머리에 둬야할 개념. Backlink 와 어떤 페이지들이 중요한지 측정하는 척도 두가지.



We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. There are more details about d in the next section. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows:

PR(A) = (1-d)/N + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))


Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages' PageRanks will be one.


다른 페이지들의 PageRank 를 구하는 건 재귀호출 방식으로 반복. 

# 출처






















'호기심, 관심사' 카테고리의 다른 글

끈을 이어야 관계가 보인다 中  (0) 2016.07.27
내공은 어떻게 쌓이는가?  (0) 2016.07.26
'경제적 자유' - 송사무장  (0) 2016.07.21
타이어 교체 시기 확인  (0) 2016.07.21
업무 우선 순위  (0) 2016.07.19
,
  [ 1 ]  

최근 댓글

최근 트랙백

알림

이 블로그는 구글에서 제공한 크롬에 최적화 되어있고, 네이버에서 제공한 나눔글꼴이 적용되어 있습니다.

태그

링크

카운터

Today :
Yesterday :
Total :