3. Indexing engine
3.1 Index
Once the web pages have been crawled, the spider sends the collected information to the indexing engine. Indexing is carried out in full text: all the words on a page, and more generally its HTML code, are then taken into account.
The indexing systems then identify, in "full text", all the words in the texts contained on the pages, as well as their position within the page. However, some engines may limit their indexing capacity. For many years, for example, Google limited its indexing to the first 101 kilobytes of a page (which was, however, quite a substantial size). Today, this limit no longer applies. Other engines can select according to document format (Excel, Powerpoint, PDF...).
Finally, as with documentary software...
Exclusive to subscribers. 97% yet to be discovered!
Already subscribed? Log in!
Indexing engine
Article included in this offer
"Software technologies and System architectures"
(
227 articles
)
Updated and enriched with articles validated by our scientific committees
A set of exclusive tools to complement the resources
Bibliography
- (1) - BRIN (S.), PAGE (L.) - The anatomy of a large-scale hypertextual web search engine. Computer networks and ISDN Systems. - https://snap.stanford.edu/class/cs224w-readings/Brin98Anatomy.pdf (1998).
- ...
Exclusive to subscribers. 97% yet to be discovered!
Already subscribed? Log in!