How to Control Search Engine Robots

Published: 17th May 2005
Views: N/A
Ask About This Article Print Republish This Article

How to Control

Search Engine Robots



Wouldn't it be nice to be able to leave some code in your web site to tell

the search engine spider crawlers to make your site number one? Unfortunately a

robots.txt file or robots meta tag won't do that, but they can help the crawlers

to index your site better and block out the unwanted ones.



First a little definition explaining:



Search Engine Spiders or Crawlers - A web crawler (also

known as web spider) is a program which browses the World Wide Web in a

methodical, automated manner. Web crawlers are mainly used to create a copy of

all the visited pages for later processing by a search engine, that will index

the downloaded pages to provide fast searches.



A web crawler is one type of bot, or software agent. In general, it starts

with a list of URLs to visit. As it visits these URLs, it identifies all the

hyperlinks in the page and adds them to the list of URLs to visit, recursively

browsing the Web according to a set of policies.



Robots.txt - The robots exclusion standard or

robots.txt protocol
is a convention to prevent well-behaved web spiders and

other web robots from accessing all or part of a website. The information

specifying the parts that should not be accessed is specified in a file called

robots.txt in the top-level directory of the website.



The robots.txt protocol is purely advisory, and relies on the cooperation of

the web robot, so that marking an area of your site out of bounds with

robots.txt does not guarantee privacy. Many web site administrators have been

caught out trying to use the robots file to make private parts of a website

invisible to the rest of the world. However the file is necessarily publicly

available and is easily checked by anyone with a web browser.



The robots.txt patterns are matched by simple substring comparisons, so care

should be taken to make sure that patterns matching directories have the final

'/' character appended: otherwise all files with names starting with that



substring will match, rather than just those in the directory intended.



Meta Tag - Meta tags are used to provide structured data about

data.



In the early 2000s, search engines veered away from reliance on Meta tags, as

many web sites used inappropriate keywords, or were keyword stuffing to obtain

any and all traffic possible.



Some search engines, however, still take Meta tags into some consideration

when delivering results. In recent years, search engines have become smarter,

penalizing websites that are cheating (by repeating the same keyword several

times to get a boost in the search ranking). Instead of going up rankings, these

websites will go down in rankings or, on some search engines, will be kicked off

of the search engine completely.



Index a site - The act of crawling your site and gathering

information.

How can the robots.txt file and meta tag help you?



In the robots.txt you can tell the harmful 'web crawlers' to leave your web

site alone, and give helpful hints to the ones you want to crawl your site.

Here is an example on how to disallow a web crawler to search your site:



# this identifies the wayback machine


User-agent: ia_archiver


Disallow: /



ia_archiver is the crawler name for the wayback machine that you may have

heard of, and the / after disallow tells ai_archiver not to index any of your

site. The # allows you to write comments to yourself so you

can keep track of what you typed.



Type the above three lines into notepad from your computer and save it to the

root directory of your web site as robots.txt. Web crawlers look for this

document first at a web site before doing anything else. This helps the

crawler to do its job, and helps the web site owner tell the spider what to do.

Say for instance you have some data that you don't want the crawlers to see.

(Like duplicate content for other browser referrer pages) You can deter

crawlers from indexing the 'duplicate' directory by typing this into your

robots.txt file.



Or if you would like to have the robots.txt file created for you, visit

www.rietta.com/robogen. To validate

your robots.txt file to make sure it works properly you can visit

www.searchengineworld.com/cgi-bin/robotcheck.cgi



User-agent: *


Disallow: /duplicate/



The * after user-agent says that this action applies to all crawlers and

/duplicate/ after disallow tells all crawlers to ignore this directory and not

search it. For each user-agent and disallow line there must be a blank

space between them in order for it to function correctly. So this is how

you would create the above two commands into a robots.txt file:



# this identifies the wayback machine


User-agent: ia_archiver


Disallow: /



User-agent: *


Disallow: /duplicate/



One thing to note that is very important: Anyone can access the

robots.txt file of a site. So if you have information that you don't want

anyone to see don't include it into the robots.txt file. If the directory

that you don't want anyone to see is not linked to from your web site the

crawlers won't index it anyway.



An alternative to blocking indexing of your site is to put a meta tag into

the page. It looks like this:



You put this into the tag of your web page. This line tells the

robot crawlers not to index (search) the page and not to follow any of the

hyperlinks on the page. So as an example

tells the robots crawlers to not index the page, but follow the hyperlinks on

this page.



Did you know that Google has its own tag?



It looks like this:

This tells the Google robot crawler not to index the page, not to follow any of

the links, and not to keep from storing cached versions of your web site.

You will want this done if you update the content on your site frequently.

This prevents the web user from seeing outdated content that isn't refreshed

because of storage in the cache.



You can use the tag to specifically talk to Google's robots to avoid

complications or if you are optimizing your site for Google's search engine.

This concludes this month's article.



Until the next article have a great day!



Copyright © Michael Rock


(You have permission to copy this article as long as it remains intact with the

author's byline)




Web development contractor (Web Design and Hosting)



Internet Presence


www.TheInternetPresence.com





The owner of this registered company

has over twenty years experience with DOS, windows business applications, numerous

programming languages, artistic development, and web design. Other areas of

interest include web marketing, web promoting, and business marketing and

development. After the persuasion of those praising his work, he decided to go

into business himself and highly suggests everyone else to do the same.





Internet Presence was founded in 2003

from a desire to become independent. Less than 1 year later Internet Presence

has had accounts in three different states ranging from a locally owned auto

collision repair shop to a glass packaging industry that sells its product

worldwide.



This article is free for republishing
Source: http://michaelrock.articlealley.com/how-to-control-search-engine-robots-1786.html


Report this article Ask About This Article Print Republish This Article


Loading...
More to Explore
 


Ask a Question About this Article

Powered by