Page Ranking Algorithms – A Comparison Part 1

November 29, 2023
32 mins read
Page Ranking Algorithms – A Comparison  Part 1

Page Ranking Algorithms – A Comparison  Part 1

The internet is vast, varied, and constantly changing. It's becoming increasingly difficult to quickly and efficiently find the specific web pages you're looking for. When searching for relevant pages, users want them to be easily accessible. However, with so much information available, it can be challenging to sift through and extract the relevant data. This is where a solution is needed to help users overcome these challenges. It's worth noting that a significant amount of web information is semi-structured, making it difficult to extract knowledge from this type of data.

Several issues currently affect search engines:

• The abundance problem(99% of info of no interest to 99% of people) 
• Limited coverage of the web(internet sources hidden behind  search interfaces) 
• Largest crawlers cover less than 18% of all web pages. 
• Limited query interface based on keyword-oriented search. 
• Limited customization to individual users. 
 

There are 3 vital components in a search engine: Crawler, Indexer, and Ranking mechanism.

The Crawler, also known as a robot or spider, is a tool that navigates the web and downloads web pages. These pages are then transferred to an indexing module that parses them and creates an index based on individual keywords. This index is usually organized alphabetically by keyword. When a user enters a query, the keywords are transferred to the search engine's interface. The query mainframe then examines the keywords with the index and provides the URLs of relevant pages to the client.

Before showing the pages to the client, search engines complete a ranking mechanism to display the most relevant pages at the top and the least important ones at the bottom. This makes it easier for the user to navigate the search results. Therefore, web mining and ranking mechanisms are crucial for efficient information retrieval.

Web mining: 

Web mining refers to the overall process of discovering potentially useful and previously unknown information or knowledge from web data. 

Web mining subtasks: 

The subtasks of web mining consist of the following phases as shown in Fig. 1 

Resource finding deals with retrieving the intended documents. Information selection or Preprocessing which selects and preprocesses the specific information from selected documents. Generalization which discovers general patterns within and across websites and Analysis which performs validation and interpretation of mined patterns. 

Web mining types: 

Web mining is divided into the following 3 types as shown in Fig. 2 

Web Content Mining: 

Web content mining is the process of extracting useful information from the contents of web documents. It includes extraction of structured data from web pages, identifying, matching, and integration of semantically similar data, opinion extraction from online sources, and concept hierarchy, ontology, or knowledge integration. Web content mining is the analog of data mining techniques for relational databases since we can expect to find similar types of knowledge from unstructured data residing in web documents. The content data consists of text, images, audio, video, or structured records.  

Web Usage Mining:

Web usage mining analyses the transaction data, which is logged when users interact with the web. Web usage mining is sometimes called log mining, because it involves mining the web server logs. Web server logs, which maintain an account of each user's browsing activity. Web servers automatically, generate large data stored in servers referred to as logs containing information about the user profile, access patterns for pages, and so on. The world’s largest portal like Yahoo, MSN, and so on, needs a lot of insights from the behavior of their user’s web visits. Web usage mining collects the data from web log records to discover users' access patterns of web pages. This can provide information that can be used for efficient and effective website management and user behavior.  

Web Structure Mining: 

Web structure mining focuses on analysis of the link structure of the web and one of its purposes is to identify more preferable documents. A typical web graph structure consists of web pages as nodes and hyperlinks as edges connecting between two related pages. Web pages can also be organized in a tree-structures format, based on the various HTML and XML tags within the page. Technically, web content mining mainly focuses on the structure of the inner document, while web structure mining tries to discover the link structure of the hyperlinks at the inter-document level.  Based on the topology of the hyperlinks, web structure mining will categorize the web pages and generate the information, such as the similarity and relationship between different websites. The goal of web structure mining is to create a structural summary of the website and web page.  

 LINK ANALYSIS ALGORITHMS

Web mining technique provides the additional information through hyperlinks where different documents are connected. We can view the web as a directed labeled graph whose nodes are the documents or pages and edges are the hyperlinks between them. This directed graph structure is known as a web graph.  

There are several algorithms proposed based on link analysis. Three important algorithms PageRank [2], Weighted PageRank [3], and HITS (Hyperlink Induced Topic Search) [4] are discussed below.  There are several algorithms proposed based on link analysis. 

A.  PageRank Algorithm

PageRank (PR) is an algorithm used by Google Search to rank websites in their search engine results. It is not the only algorithm used by Google to order search engine results, but it is the first algorithm that was used by the company, and it is the best-known.  PageRank was named after Larry Page, [5] one of the founders of Google. PageRank is a way of measuring the importance of website pages. It is considered the basis for all modern Search Engines. The underlying assumption is that more important websites are likely to receive more links from other websites.  

According to Google PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. It Ranks pages based on the number of backlinks pointing to them. The algorithm assigns pages a Total PageRank based on the PageRanks of the Backlinks pointing to the page. The links to a page can be categorized into the following types: Inbound links which are links into the given site from outside and from other pages. Outbound links which are links from the given page to pages on the same site or other sites and Dangling links which are links that point to any page with no outgoing links. 

The PageRank of a web page is calculated as a sum of the PageRanks of all pages linking to it(its incoming links), divided by the number of outlinks on each of those pages( its outgoing links). 

Where PR(A) is the PageRank of page A 
PR(Ti) is the PageRank of pages Ti which link to page A C(Ti) is the number of outbound links on page Ti d is a damping factor that can be set between 0 and 1. It depends on the number of clicks, usually set to 0.85 n is the number of links on page A. 

Following is a simplified example of the PR algorithm. 
Consider the web graph shown in Fig. 4 

Page, P is referenced by pages Q and R. Q, and R has 2,3 outlinks. Then PageRank value of the page P is given as:  

PR(P)=1-d + d(PR(Q)/2 + PR(R)/3)

The PageRank algorithm does not rank the whole website, but it is determined for each page individually. Furthermore, the PageRank of page A is recursively defined by the PageRank of those pages that link to page A. 

Iterative Method of Page Rank 

In iterative calculation, each page is assigned a starting page rank value of 1. These rank values are iteratively substituted in page rank equations to find the final values. In general, many iterations could be followed to normalize the page ranks. 

The PageRank algorithm can be iteratively applied as: 

1) Initially let the Page rank of all web pages is one. 
2) Calculate page ranks of all pages by using the above formula. 3) Repeat step 2 until the values of two consecutive iterations match. 

 Advantages: 

• Since it pre computes the rank score it takes less time and hence it is fast.  
• It is more feasible as it computes rank score at indexing time not at query time. 
• It returns important pages as Rank is calculated based on the popularity of a page. 

Disadvantages: 

• The main disadvantage is that it favors older pages, because a new page, even a very good one, will not have many links unless it is part of an existing website. 
• Relevancy of the resultant pages to the user query is very low as it does not consider the content of the web page. 
• Other problems exist in the form of Dangling links which occurs when a page contains a link such that the hypertext points to a page with no outgoing links. 
• It leads to a Rank sink problem that occurs when in a network pages get in infinite link cycles. 
• Dead Ends are possible ie., pages with no outgoing links. 
• Another problem in PageRank is Spider Traps. A group of pages is a spider trap if there are no links from within the group to outside the group. 
• If you have circle references in your website, then it will reduce your front page’s PageRank. 

Keep reading

More posts from our blog

Page Ranking Algorithms – A Comparison Part 2
By November 29, 2023
Page Ranking Algorithms – A Comparison  Part 2B. Weighted Page RankThe weighted page rank algorithm3 was proposed by Wenpu Xing and Ghorbani...
Read more
LinkPay.in: The Best Link Shortener Website for Efficient Link Management
By November 14, 2023
LinkPay.in: The Best Link Shortener Website for Efficient Link ManagementIn today's digital age, the importance of effective link management cannot be...
Read more