search engines

Things which search engines don't like

Just as you want to do your best to optimise pages for search engines, so it's wise to avoid things that they don't like. Technically, search engines lag behind the times and prefer simple pages, but there are some Web page elements which cause search engines problems and should be avoided. These elements are either easy to avoid or easy to implement pragmatic workarounds which dampen their negative effects.

Spamdexing

Top of the list of things that search engines don't like is 'spamdexing' - various tricks which some people use to get themselves higher in search engine rankings. Search engines are now wise to these tricks and downrank (or even don't index) sites which use them. These tricks are covered in detail in our associated briefing: 'search engine tricks - and why they are not worth using'. It's worth reading this briefing to ensure that you are not accidentally spamdexing or perhaps if you've been persuaded that some of these tricks work - and are not aware of their negative effects.

Flash

Unlike many Web consultants, I don't have downer on Flash. It's fun, funky and cool, provided that it's used well. Search engines do have a downer on Flash. A Flash site is an invisible site as far as search engines are concerned. Search engines can't currently spider the text within a Flash site, though, since the Flash API is now 'open', I'm sure that it's only a matter of time, especially since almost all Web browsers can now view Flash sites (top search engines like Google can now index PDF files directly, and then display these on-the-fly as HTML, so it can't be long before the same happens for Flash files). For now, though, text in Flash sites is invisible to search engines.

A Flash designer will tell you that it doesn't matter than search engines can't index Flash, because the Flash can sit within a HTML page that can contain hidden meta tags. Ouch! Search engines now like to index the content of the page, and few pay much attention to meta tags (and most ignore the meta keywords tag which this strategy demands).

Another popular use of Flash is on splash screens - an initial page which greets the visitor with some kind of animation. These pages are also not great for search engines, for the same reason. Search engines count the index or home page as one of the most important, and this sets the context for subsequent indexing. So what often gets indexed in these cases are a few irrelevant words, such as 'click here to skip intro'.

Some sites work really well in Flash, but most could combine Flash with HTML - at least for key pages - to get the best of both worlds. There's usually little creative compromise in this. Only go for 'all Flash' if you don't care if your site never gets listed on search engines.

Frames

Some Web designers (usually newbies) love frames because they make managing Web site navigation easier. When you're new to Web design, they seem cool technically. Most experienced Web designers and usability experts believe that frames suck - for several reasons. What we're most concerned with in this briefing is that most search engines are totally flummoxed by Web sites which use frames.

Frames present more than one HTML page, seamlessness stitched together and presented as a single page. A common use of frames is to have navigation at the left in one frame (which remains static) and content in the main portion of the screen (which can scroll). In this situation, the 'page' actually consists of three HTML pages - the left page, the content page and the page which 'holds' the frameset itself. Sites can have many more pages built into a frameset - another common layout is navigation at the left and top, and content in the main portion of the screen - which takes four HTML pages.

This is great when you browse to the site and everything is sitting correctly within its frameset. But a search engine just sees four pages. It doesn't know which is which. It is more likely to index the content page and rank it higher (usually the navigation pages contain graphics, not text, and even if they contain text it will be navigation/hyperlinks, of little interest to the search engine). So it indexes the content page - which it will present so a person who is searching. This person visits the site and sees only the content page. No navigation and no way of finding it.

There are workarounds - such as using the <NOFRAMES> tag to present a separate page to search engines. This is a poor solution, as we really want to get our entire site indexed, not just one page.

Also, because using frames multiplies the number of HTML pages, it dilutes the density of keywords on your site by that value, making the keywords that the search engine can see less relevant by that amount.

The biggest advantage of frames is that you can create one page for navigation and use it on all pages in your site, so when you make changes, the site is easier to manage. (Technically you can achieve pretty much the same thing with SSI (server side include) files or tools within your Web design tool, like Dreamweaver's library items.) But the downside of frames is so huge, so enormous, that they are simply not worth using. They confuse the heck out of search engines and trash your changes of a high ranking. The best workaround: forget frames.

Frames also suck from a usability perspective, because they break the 'page metaphor' of the Web, making it difficult or even impossible to link properly to specific pages.

Graphics

Search engines can't read graphics. It's common practice to use for navigation and even headings, and, given the sorely design-limited nature of the Web, it's easy to see why people do this (heck, this site uses graphics for the top menus, too, though not for headings).

Don't stop using graphics (how dull would your site be then?) - but use them sensibly. Don't use them for anything you want to have indexed. Alt text (alternative text) is supported by some search engines, but it never ranks as high as page content.

PDFs

There's some good news and some bad when it comes to PDF files. Top search engines can now read and index PDF files stored on Web sites - and even display their contents as HTML to people who don't have Acrobat Reader. This is very useful and a result of Adobe making the PDF file format an 'open' (or published) standard. From a usability perspective, Web sites which use PDF files instead of pages, or don't offer a warning that links lead to a PDF file, tend to frustrate visitors. It takes a lot longer to load up a PDF page than a Web page, so people tend to be unsure about what's happening while a big file is loading. What we're worried about here is search engine results, though, so if PDFs can be indexed, that's all fine and dandy, surely?

One of the main things to be wary of is whether you actually want your PDFs to be indexed. Many PDFs are white papers or contain more detailed information, with a mechanism for visitor to register to get them or perhaps reach them via a defined path. There are quite a few reasons why you might consider that the content of your PDFs should not be indexed. In these cases, the answer is simple. Put them in a common folder, and use a ROBOTS.TXT file to stop search engines indexing the content of that folder.

The ROBOTS.TXT file is a plain text file, placed at the root of your Web site. A sample ROBOTS.TXT file to half indexing on PDFs stored in a PDF folder would look like this:

User-agent: *
Disallow: /pdf/

Before using the ROBOTS.TXT file, you can find out more about it, and how you can also use meta tags to halt indexing in our associated briefing on meta tags.

Cookies

Cookies are small text files (usually harmless) which are passed to the browser, usually to make using a site easier. For examples, cookies can be used to recognise who you are, so that any choices you made the last time you visit a site are shown when you go back again. Lots of sites use cookies and they are nothing to be worried about. Where they cause a problem for search engines is when a site is designed in such a way that the browser has to accept a cookie before the site (or portions of it) can be accessed. The workaround is to take care when using cookies and to ensure that they don't have to be accepted. (It is also possible to use a script to recognise when a visitor is a search engine, so that a cookie is not served.)

Passwords

It's obviously really. If you need a password to get beyond a certain point on your site, search engines can't get beyond that point either - so password-protected portions of a Web site can't get indexed. This is almost always a good thing (and the behaviour you intended), but sometimes information which is intended for broader consumption sits in a password protected area. Solution - move it out into the open if you want it indexing.

Stop words

To make processing simpler and conserve resources, many search engines don't index common words, or words that have little value. There's no definitive list of these, but they include 'the,' 'a,' 'about,' 'also,' 'and,' 'if,' 'is,' 'that' and so on. So, there is a need to ensure that in the most indexed parts of the page that use stop words are kept to a minimum, especially at the start of a sentence, paragraph or text block (because, to be frank, there's no way to construct many sentences without stop words). So don't start a sentence with 'the best...' for example. Keep stop words out of your page title, URL and meta description if you can.

Dynamic URLs

Web sites with dynamic content have been something of a search engine headache. Dynamic content is (as the name suggests) generated on the fly, from a database, and presented to the browser as HTML (as compared to pages of static content which are pure 'flat' HTML). There are many advantages to using dynamic content - especially for larger sites, in fact, many larger sites would be almost impossible to manage if the content was not dynamic.

Dynamic content can be pulled from product databases, ordering systems, stock systems, sales systems, systems designed simply for people to share the task of managing content - or even all of these, on the same Web site. This means that vast amounts of continually changing content can always be displayed in its most up-to-date form - which is why it is called 'dynamic content'. Sounds great - so what's the problem with it as far as search engines are concerned?

Because dynamic content is generated on the fly, the URLs for each page are also dynamic. Compare these URLs:

Static URL:
http://www.labrow.com/briefings/searchengines/disadvantaged.htm

Dynamic URL:
http://www.labrow.com/briefings/briefings.asp?type=searchengines&topic=disadvantaged

They're both long URLs (since the target document is stored within a logical site structure) but the length of the URL isn't the issue - it's the ASCII characters "?" and "=" which cause some search engines to halt.

Most search engine spiders use very primitive technology, the equivalent of browsing to a Web site with a 10-year old Web browser. They don't have to read lots of the new additions to HTML, since they have one job: to index the text. These types of search engines can't cope with the special characters used in dynamic URLs. Those other search engines which are smarter know that these URLs can potentially lead them into a huge processing loop, because there might be hundreds or thousands of pages behind each URL string.

So, dynamic Web sites using most (though not all) content-management technologies tend to be disadvantaged on some search engines.

This is a pain because dynamic content-management technology is really the only option for many Web sites and workarounds are painful. (In the above example, it is possible, on the Web server, to an application which filter out the special characters and present a cleaner URL. Also, there are technologies do read a dynamic site and automatically transform it into a static one, but you obviously lose the key benefit of a dynamic site - that the information is always up-to-date.

The good news is that the leading search engines are improving, and some can now read dynamic URLs better than they could. My tests with Google indicate that it can, and does, read and index Web sites with less complex dynamic URLs. Some search engines can only spider dynamic pages if each dynamic URL is submitted, and this was once true of Google. Since dynamic Web sites a significant proportion of the Web, the indexing of them can only get better.

It's also not too hard to include quite a few pages within your dynamic site which don't have such complexly formed URLs. A good example is the main page of a product folder, which could be named simply 'index.asp'. With a little care, lots of pages could have a simple name. These pages can be better optimised for search engines, as entry points to various sections of your site.

While it's hardly sensible to avoid using content-management tools when they provide the ideal solution, it's not difficult to create a pragmatic workaround to keep key pages of your Web site well-tuned for search engines. These changes won't compromise your site or be complicated to implement.

Broken links

People can't follow broken links (hyperlinks which don't work, perhaps because a page has been moved or renamed) and neither can search engines. All links on your site should be checked thoroughly. (If you use a tool like Dreamweaver, it can do it for you automatically.)

Quickly refreshing pages

Sometimes URLs redirect the browser to anther page automatically - with our without a time delay. Perhaps there is a 'welcome' page which you want people to see only briefly. Perhaps you have promoted an easy to remember URL in literature (such as www.golfballsandmoregolfballs.com/nike) which redirects the browser to the real page (such as www.golfballsandmoregolfballs.com/catalogue/golfballtype.asp?type=nike&offer=yes which is just too darned complex to promote).

What you need to bear in mind is that search engines distrust pages which refresh automatically and quickly, especially where the refresh is set to '0' so the page is never seen. Spamdexers load such pages with irrelevant content and keywords in the hope that search engines will index them, bringing visitors who never see the page. This is why search engines now tend to ignore such pages. This is a shame, because they have a useful function - but it is a fact which you need to be aware of when using them. Best not to, or only use them where it's not a worry if subsequent indexing is compromised.

Navigation image maps

Image maps are convenient and often cool for navigation. Normally, image-based navigation will contain one graphic per navigation item (home, about, contact and so on). An image map uses one graphic for everything - and the designer maps out zones over the parts of the image which contain navigation words. There are pros and cons to both approaches, but rarely is using an image map inescapable (though it can be quicker to download than several images, which is why they are often used). Some search engines can't follow the links within image maps, which means that not all of your site will be indexed - by some engines. All search engines can follow text links and links associated with single images. There are two workarounds. The obvious one is to avoid image maps where possible. The other is to use text links, in addition to your image map, at the foot of the page in addition to the main menu links. Search engines will find and follow these. (it's good practice to do this anyway for several reasons: people expect them there when they scroll down and they are useful for people who browse with images turned off (yes, quite a few people do).

Conclusion

There are lots of things which search engines, being conservative beasts, don't like. I'm not aware of any that can't be avoided or worked around. Employing a workaround will increase your search engine results, often with little compromise to your site. Of course, if you want a Flash-only site, (for example) then that's fine, provided you don't have high aspirations for people finding it. Time will see search engines getting better at handing some of these issues - great progress has been made with PDF files and dynamic URLs, for example. But don't count on all of them being fixed - frames have baffled search engines for years, and that's hardly new technology.

 

This article © Copyright Labrow Marketing. Please do not reproduce this article without permission. If you wish to reproduce this article please contact us.