Use the search engine principle to explain the crawler spider is what
has now explained the crawler can not judge the quality of the page, but in the strict sense, even the role of reptile extract links are not, it is only a TCP/IP program. But the analysis of link things always do, otherwise can’t crawler grab a new page. Accurately speaking, link analysis is taken over by "dispatcher" to do. The crawler to crawl the page 1, page 1 to 1 the dispatcher dispatcher analysis, all found links to URL library 1, and the part of the dispatcher think important link to return to the 1 to 1 crawler crawler, to grab those important pages. At the same time, 1 crawler grab the page to the Page library 1, if repeated 1 inside the Page library and URL library page 1 inside, will not repeat the crawl.
crawler is miraculous, also caused one of the most common practice after practice proves that love Shanghai – wise remark of an experienced person "will be the second crawler original content
This is from 贵族宝贝csdinuan贵族宝贝 The search engine included?Some people even think that !
large commercial search engines are more crawler to work together, each "dispatcher" also "total control" and the exchange of information, the specific work and distribution of the various reptiles. If you see a few reptiles often turns in a short time to grab a multiple page it is often did not do a good job scheduling.
of course in any one understand the principle of search engine people’s eyes, this is not reliable in practice. If the practice is to verify the truth of the way, it is a theoretical hypothesis before the relatively perfect after verification. And like a reptile no content analysis, how can determine whether the original page content after
Many people think the
four systems: download, analysis, indexing, query, these four pieces of work basically independent of the work and judge whether the acquisition in system analysis. And the estimation is considered for large-scale page check the efficiency, duplicate pages are generally index after a long time will be deleted. That is, the search engine included page or not, at least not the quality of the page itself.
crawler don’t capture the content, which is more surprising, reptile not a prophet, how will you know whether the page crawl before the acquisition is not considered here? (a special case, namely the search engine may refer to the original rate to determine the overall site grab priority issues, but this relatively deep)
, allowed to reprint, but please keep the link.
but in fact such as "dispatcher" and the like, in the crawler program which can’t be wrong. Just a saying, a relatively rigorous argument is relatively loose. But in any case is, just download the most reptiles, with more than a few tricks to download it dispatcher.