Google searches are so ingrained in our everyday life, it’s easy to forget the incredible feat of engineering they represent. Given the sheer amount of information Google stores, the speed and relevance of the results obtained are astonishing to say the least. The main point of this article is to try and break down this myth and introduce simple techniques used behind the scene in order to demonstrate search engines aren’t actually unfathomable.
For the sake of brevity, I will skip the “crawling”, which is the process of collecting documents from the web (in search engines, web pages are also referred to as documents by convention). I will assume we are in possession of a large collection of documents which we now need to search through every time a user makes a query.
It seems obvious that re-reading every document every time a user makes a query is not viable. Instead, what a search engine will do is indexing. Indexing is the process of storing key information related to the documents in such a way that future searches will find it with ease. It can involve clustering similar documents together by assigning a common keyword, or storing numerical values based on properties such as the word count of the document. A simple example of indexing in everyday life can be found in supermarkets, where alleys are indexedby what they contain (meat, biscuits & cereals, milk & yogurt, .. ).
When it comes to documents, there are as many indexing techniques as people on earth, so I’ll keep it simple and introduce two of the most basic yet fundamental ones.
Inverted index is the simplest yet still extremely powerful document indexing method. It consists of creating a vocabulary containing every word from every document in your collection, and recording which words appear in which document. Carrying out this procedure, you would end up with a table which should look something like this:
This table tells us, for example, that “book” appears in documents 5, 6, 17, 23.
Once all our documents are indexed, we are ready to retrieve lists of (somewhat) sensible documents given a query. For example, let’s say a user searches for “book about the first computer“, we simply find the document which contains the largest number of words from the query. In this case, we can see that document#5contains “book“, “about“, and “computer“, and therefore will be our first result.
The idea behind positional indexis that words that are close together, are more likely to be correlated than words that are paragraphs away. In the previous example, the fact that the three queried words are present in document#5does not imply that they are related to each other in the document itself.
To account for this, positional indexing not only records whether a word appears in a document, but additionally stores the positions at which it appears. The positional index table would look something like this:
The bracketed numbers provide the positions at which the word appears in a document. For instance, here “computer” appearsat positions 10and 25 in document#2.
Going back to the earlier query which included “book“, “computer” and “about“, we can now see that in document#5, which used to be the favourite, these appear at a significant distance from each other, with “book” being the 12th word, “about” the 75th and “computer” the 52nd. On the other hand, document#17only contains “book” and “computer“, but these appear very close to each other, at positions 10 and 12 respectively. Given that there is only one word in between, we can state with fairly high confidence that they’re likely related to each other and therefore more relevant to the query.
The topic of search engines is too vast to be thoroughly discussed in one post, but I hope you now have a basic understanding of some of the fundamental methods used for indexing, which are still widely used today and usually constitute the first step toward building any search engine.
There is a vast amount more that can be done to improve the performance and results. For instance, one concept which I find especially interesting and which I couldn’t include out of brevity is the term frequency–inverse document frequency method, which is related to the amount of information a word carries. I may include it in a future article if some interest is shown.
I hope you enjoyed and learned something from this article, if you’d like more articles on the topic or have any question, do not hesitate to leave a comment!