search engines

How search engines work

Some people don't care how things work - they just want them to work. Usually, though, a little knowledge can help us to use things better, and get more from them. This is certainly the case with search engines. If you want to get better results from search engines, the first step is to understand how they work.

Gosh, those searches are fast

The term 'search engine' is somewhat misleading. You'd be forgiven for assuming that, the moment you type in a search phrase and click 'go,' your favourite search engine is off hunting around the Web. Since the Web holds in excess of two billion pages (and millions are added each day) that isn't possible - at least in real time. So how do search engines deliver results within seconds?

Simple: search engines already hold the data that you're searching for. But how did it get there?

Directories and spiders

Since the term 'search engine' is used to describe all Web-based search tools, it would be forgivable to assume that they all work in the same way. They don't. As an extreme example, all entries in Yahoo were once entered, edited, verified and placed into carefully structured categories by humans. Even today, this manual process makes up much of Yahoo's database of Web sites.

Using only humans to index something that grows as fast as the Web is a severely limited process - even armies of people can't keep up with the growth of the Web. Plus, humans aren't going to have time to index a whole Web site - perhaps they'd add just the title, a description and some keywords to the index. So a much faster, and thorough, process was needed, even if it was less discriminating about the type of content indexed. Enter the spider.

Spiders are 'virtual robots' (used by search engines) which roam the Web, indexing Web sites. The process takes place quickly, constantly and automatically.

The word spider is very apt: the spider almost literally crawls all over a Web page, looking for text. It even examines text that is hidden from humans.

When spiders visit a Web page, they record the words on that page (saving them to the search engine database) and then follow any links on that page to other pages, which are themselves indexed - and so on. Search engines can only read text (though hidden text, including meta tags, can also be presented to search engines as part of the same page). Search engines can't understand graphics or Flash movies: they can register that they exist, but they can't use them as searchable data, in the same way they can with text.

Getting found

Search engines can't see a site if they are not told about it. If you create a site, and don't submit it to any search engines, and no other sites have links to it, then it won't get listed on a search engine - however much time you spend tuning the keywords in your meta tags. It's invisible.

Getting listed isn't too complicated, even for free - though many search engines now don't include the facility to submit sites free of charge (wanting to steer people to their chargeable services). Fortunately, some search engines share data, and many draw data from the Open Directory Project, which accepts free submissions (one page per site). The Open Directory Project is managed by humans (who do a good and fast job, it has to be said; much of the work is managed by armies of volunteer subject-matter enthusiasts/experts, thus avoiding the financial issues of employing so many people). Submissions are vetted before being accepted. Every so often, major search engines supplement their databases with information from the Open Directory Project.

You can also get listed just by getting a link on a site which is already listed - as the next time that site is reindexed, the spider will follow the link to your site, and then index it.

What do spiders read on your Web page?

Many people think that meta tag 'keywords' are the key to search engine success. In fact, few search engines read these keywords now, and those that do only look at them in the context of the page content itself.

The use of meta tag keywords has been subverted, and search engines now tend to ignore them, because:
• lots of people crammed meta tags with irrelevant keywords in an effort to boost their search engine rankings.
• many other people don't include them in their site.
• people don't implement them properly.

Instead, the content of the page itself is spidered (though many common words - such as 'it' and 'the' may be ignored to save processing time and storage space). This gives results which are more representative of what is actually on the page. (Read our associated briefings to find out how to optimise your pages for search engines or how to avoid things which search engines don't like.)

Getting missed

With the Web growing so quickly and containing so many sites, it's impossible for search engines to index all of it. So it's easily feasible for a site to never get listed. Worse, every so often, old, unchanging Web sites get 'junked' by search engines, which generally favour newer content.

Because the odds are against you, it's hardly sensible to not do something about optimising a site for search engines and ensuring that it gets listed. Once you're listed, changing your content (or at least the content on key pages) every few months is a valuable exercise in keeping yourself there.

Database query - blindfold

When you use a search engine, you are actually performing a database query, with the most 'relevant' results 'guessed' by the search engine. It does this by examining how those Web pages it has indexed were constructed.

With a single word, you can easily imagine how this works. If you enter the word 'golf', the search engine hunts through its database for sites containing 'golf' and returns these. So how come some sites get to the top of that list and some don't? Is this pure chance? Not really, though search results often seem like something of a lottery.

The search engine looks for the word in specific places on a page, how many times the word appears and the ratio of this word to the overall text. So, broadly speaking, the more times the word 'golf' appears on a page, and the less other text there is (assuming that the word is in the spiders' preferred areas on a page) then the higher the page will rank.

With some search engines, other factors are also taken into account. For example, if the search word is also found within the meta data on a page (this could be in the hidden keywords, in the 'alternative text' hidden behind images - and especially in the page title) then this can boost the page's rankings. Another factor is the 'link popularity' of that page - how many other pages link to that page, which is sometimes seen to make the page more important than unlinked pages.

Based on that understanding of how a search engine works, many people believe that simply peppering a page with recurring instances of the same word will get you great results easily. It doesn't - this behaviour has the opposite effect. Search engines are getting wise to such underhand tricks and 'downrank' pages which use them. You can find out more about this in our associated briefing: search engine tricks and why not to use them.

Okay, that's what happens with one word - but what about multiple words, for example, when you enter a phrase, such as 'golf balls'? The same rules apply, but now the search engine is working harder. It doesn't just look for 'golf balls' as two adjacent words (though it does do this). It examines its database for pages with instances of 'golf' and 'balls', and ranks higher those pages which have the most occurrences of 'golf' and 'balls' in close proximity. (Some examples include 'golf courses in the UK with balls for hire' and 'golf caddies, clubs and balls'.) It also takes into account all the other factors which I've previous mentioned. When you consider how much work they have to do, it's amazing how fast most search engines can return decent results.

You can see why search engines just search only for directly adjacent words, by typing in a perfectly reasonable search phrase: 'low price golf balls UK'. The chances of these exact words being on a page, in that order, are pretty low, it's more likely that they'd be in a sentence such as 'the lowest price golf balls in the UK'. This sentence isn't an exact match, but it contains all the key words, but it would still rank highly: because most of them are included and close together. Also, a search engine can draw information from the meta data on the page, so if 'the lowest price golf balls' appeared on the page, and 'The Golf Shop - the best place for golf balls in the UK' appeared in the hidden meta description tag, then the page would still rank well - as all of the words are likely to be found.

If only there were one standard

All search engines work differently. This is why pages will be ranked differently on different search engines. Search engines read, and value, different things - and then rank them differently. This makes refining your pages for search engines something of a pain, because you're not going to please all of them. The good news is that most people use just a few search engines - so a strategy to optimise your pages for just those top engines will give you better results than trying to optimise them for all search engines.

Conclusion

Search engines do a complex job, with lots of data - and they do it quickly. Search engines want to index your site (they wouldn't be offering a useful service if they made a bad job of indexing). But to get ranked highly, you need to design and write your Web pages in a way that is sympathetic to the way that search engines work. This can bring very effective results reasonably quickly, though there are no guarantees of top results, simply because you're just one site in millions.

 

This article © Copyright Labrow Marketing. Please do not reproduce this article without permission. If you wish to reproduce this article please contact us.