<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DataMouse.biz Blog &#187; Crawl</title>
	<atom:link href="http://www.datamouse.biz/blog/tag/crawl/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.datamouse.biz/blog</link>
	<description>Just another WordPress weblog</description>
	<lastBuildDate>Thu, 25 Jun 2009 12:53:07 +0000</lastBuildDate>
	
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Working with Robots.txt</title>
		<link>http://www.datamouse.biz/blog/2008/10/working-with-robotstxt/</link>
		<comments>http://www.datamouse.biz/blog/2008/10/working-with-robotstxt/#comments</comments>
		<pubDate>Wed, 15 Oct 2008 18:10:36 +0000</pubDate>
		<dc:creator>DataMouse</dc:creator>
				<category><![CDATA[SEO]]></category>
		<category><![CDATA[Web Design]]></category>
		<category><![CDATA[Crawl]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Robots]]></category>

		<guid isPermaLink="false">http://www.datamouse.biz/blog/wordpress/working-with-robotstxt/65/</guid>
		<description><![CDATA[The robots.txt file is not just a way to control what search engines can and can't see on your web site. It has loads of other really importantfunctions too.
Check out this comprehensive article on how to work with robots.txt.]]></description>
			<content:encoded><![CDATA[<p><strong>What is the robots.txt file?</strong><br />
The robots.txt file is an ASCII text file that has specific instructions for search engine robots about specific content that they are not allowed to index. These instructions are the deciding factor of how a search engine indexes your website&#8217;s pages. The universal address of the robots.txt file is: www.domain.com/robots.txt. This is the first file that a robot visits. It picks up instructions for indexing the site content and follows them. This file contains two text fields. Lets study this example:</p>
<p><em>User-agent: *</em></p>
<p><em>Disallow:</em></p>
<p>The User-agent field is for specifying robot name for which the access policy follows in the Disallow field. Disallow field specifies URLs which the specified robots have no access to. An example:</p>
<p><em>User-agent: *</em></p>
<p><em>Disallow: /</em></p>
<p>Here &#8220;*&#8221; means all robots and &#8220;/ &#8221; means all URLs. This is read as, “No access for any search engine to any URL&#8221; Since all URLs are preceded by &#8220;/ &#8221; so it bans access to all URLs when nothing follows after &#8220;/ &#8220;. If partial access has to be given, only the banned URL is specified in the Disallow field. Lets consider this example:</p>
<p><em># Research access for Googlebot.</em></p>
<p><em>User-agent: Googlebot</em></p>
<p><em>Disallow:</em></p>
<p><em>User-agent: *</em></p>
<p><em>Disallow: /concepts/new/</em></p>
<p>Here we see that both the fields have been repeated. Multiple commands can be given for different user agents in different lines. The above commands mean that all user agents are banned access to /concepts/new/ except Googlebot which has full access. Characters following # are ignored up to the line termination as they are considered to be comments.</p>
<p>Working with the robots.txt file</p>
<p>1. The robots.txt file is always named in all lowercase (e.g. Robots.txt or robots.Txt is incorrect)</p>
<p>2. Wildcards are not supported in both the fields. Only * can be used in the User-agent field&#8217;s command syntax because it is a special character denoting &#8220;all&#8221;. Googlebot is the only robot that now supports some wildcard file extensions.<br />
Ref: http://www.google.com/support/webmasters/bin/topic.py?topic=8475</p>
<p>3. The robots.txt file is an exclusion file meant for search engine robot reference and not obligatory for a website to function. An empty or absent file simply means that all robots are welcome to index any part of the website.</p>
<p>4. Only one robots.txt file can be maintained per domain.</p>
<p>5. Website owners who do not have administrative rights cannot sometimes make a robots.txt file. In such situations, the Robots Meta Tag can be configured which will solve the same purpose. Here we must keep in mind that lately, questions have been raised about robot behavior regarding the Robots Meta Tag. Some robots might skip it altogether. Protocol makes it obligatory for all robots to start with the robots.txt thereby making it the default starting point for all robots.</p>
<p>6. Separate lines are required for specifying access to different user agents and Disallow field should not carry more than one command in a line in the robots.txt file. There is no limit to the number of lines though i.e. both the User-agent and Disallow fields can be repeated with different commands any number of times. Blank lines will also not work within a single record set of both the commands.</p>
<p>7. Use lower-case for all robots.txt file content. Please also note that filenames on Unix systems are case sensitive. Be careful about case sensitivity when defining directory or files for Unix hosted domains.</p>
<p>You can use the robots.txt Validator to check your robots.txt from www.searchengineworld.com.</p>
<p><strong>Advantages of the robots.txt file</strong><br />
Protocol demands that all search engine robots start with the robots.txt file. This is the default entry point for robots if the file is present. Specific instructions can be placed on this file to help index your site on the web. Major search engines will never violate the Standard for Robots Exclusion.</p>
<p>1. The robots.txt file can be used to keep out unwanted robots like emailing retrievers, imaging strippers etc.</p>
<p>2. The robots.txt file can be used to specify the directories on your server that you don&#8217;t want robots to access and/or index e.g. temporary, cgi, and private/back-end directories.</p>
<p>3. An absent robots.txt file could generate a 404 error and redirect the robot to your default 404 error page. Here it was noticed after careful research that sites that do not have a robots.txt file present and had a customized 404-error page, would serve the same to the robots. The robot is bound to treat it as the robots.txt file, which can confuse it&#8217;s indexing.</p>
<p>4. The robots.txt file is used to direct select robots to relevant pages to be indexed. This specially comes in handy where the site has multilingual content or where the robot is searching for only specific content.</p>
<p>5. The need for the robots.txt file was also felt to stop robots from deluging servers with rapid-fire requests or re-indexing the same files repeatedly. If you have duplicate content on your site for any reason, the same can be controlled from getting indexed. This will help you avoid any duplicate content penalties.</p>
<p><strong>Disadvantages of the robots.txt file</strong><br />
Careless handling of directory and filenames can lead hackers to snoop around your site by studying the robots.txt file, as you sometimes may also list filenames and directories that have classified content. This is not a serious issue as deploying some effective security checks to the content in question can take care of it. For example if you have your traffic log on your site on a URL such as www.domain.com/stats which you do not want robots to index, then you would have to add a command to your robots.txt file. As an example:</p>
<p><em>User-agent: *</em></p>
<p><em>Disallow: /stats/</em></p>
<p>However, it is easy for a snooper to guess what you are trying to hide and simply typing the URL www.domain.com/stats in his browser would enable access to the same. This calls for one of the following remedies -</p>
<p>1. Change file names:<br />
Change the stats filename from index.php to something different, such as stats- new.php so that your stats URL now becomes www.domain.com/stats/stats-new.php</p>
<p>Place a simple text file containing the text, &#8220;Sorry you are not authorized to view this page&#8221;, and save it as index.php in your /stats/directory.</p>
<p>This way the snooper cannot guess your actual filename and get to your banned content.</p>
<p>2. Use login passwords:<br />
Password-protect the sensitive content listed in your robots.txt file.</p>
<p><strong>Optimization of the robots.txt file</strong><br />
The Right Commands in robots.txt :<br />
Use correct commands. Most common errors include &#8211; putting the command meant for &#8220;User-agent&#8221; field in the &#8220;Disallow field&#8221; and vice-versa.</p>
<p>Please also note that there is no &#8220;Allow&#8221; command in the standard robots.txt protocol. Content not blocked in the &#8220;Disallow&#8221; field is considered allowed. Currently, only two fields are recognized: &#8220;The User-agent field&#8221; and the &#8220;Disallow field&#8221;. Experts are considering the addition of more robot recognizable commands to make the robots.txt file more Webmaster and robot friendly.</p>
<p>Note     Note: Google is the only search engine, which is experimenting with certain new robots.txt commands.<br />
It recognizes the &#8220;allow&#8221; command. Please read more details on the google site for robots.txt usage.</p>
<p><strong>Bad Syntax:</strong><br />
Do not put multiple file URLs in one Disallow line in the robots.txt file. Use a new Disallow line for every directory that you want to block access to. Incorrect Robots.txt</p>
<p>Example:</p>
<p><em>User-agent: *</em></p>
<p><em>Disallow: /concepts/ /links/ /images/</em></p>
<p>Correct robots.txt example:</p>
<p><em>User-agent: *</em></p>
<p><em>Disallow: /concepts/</em></p>
<p><em>Disallow: /links/</em></p>
<p><em>Disallow: /images/</em></p>
<p><strong>Files and Directories:</strong><br />
If a specific file has to be disallowed, end it with the file extension and without a forward slash in the end. Study the following robots.txt example:</p>
<p>For file:</p>
<p><em>User-agent: *</em></p>
<p><em>Disallow: /hilltop.phpl</em></p>
<p>For Directory:</p>
<p><em>User-agent: *</em></p>
<p><em>Disallow: /concepts/</em></p>
<p>Remember if you have to block access to all files in the directory, you don&#8217;t have to specify each and every file in robots.txt. You can simply block the directory as shown above. Another common error is leaving out the slashes altogether. This would leave a very different message than intended.</p>
<p><strong>The Right Location for the robots.txt file:</strong><br />
No robot will access a badly placed robots.txt file. Make sure that the location is www.domain.com/robots.txt.</p>
<p><strong>Capitalization in robots.txt</strong><br />
Never capitalize your syntax commands. Directory and filenames are case sensitive in Unix platforms. The only capitals used per standard are: &#8220;User-agent &#8221; and &#8220;Disallow&#8221;</p>
<p><strong>Correct Order for robots.txt :</strong><br />
If you want to block access to all but one or more than one robot, then the specific ones should be mentioned first. Lets study this robots.txt example:</p>
<p><em>User-agent: *</em></p>
<p><em>Disallow: /</em></p>
<p><em>User-agent: MSNbot</em></p>
<p><em>Disallow:</em></p>
<p>In the above case, MSNbot would simply leave the site without indexing after reading the first command. Correct syntax is:</p>
<p><em>User-agent: MSNbot</em></p>
<p><em>Disallow:</em></p>
<p><em>User-agent: *</em></p>
<p><em>Disallow: /</em></p>
<p><strong>The robots.txt file :</strong><br />
Not having a robots.txt file at all could generate a 404 error for search engine robots, which could redirect the robot to the default 404-error page or your customized 404-error page. If this happens seamlessly, it is up to the robot to decide if the target file is a robots.txt file or an html file. Typically it would not cause many problems but you may not want to risk it. It&#8217;s always a better idea to put the standard robots.txt file in the root directory, than not having it at all.</p>
<p>The standard robots.txt file for allowing all robots to index all pages is:</p>
<p><em>User-agent: *</em></p>
<p><em>Disallow:</em></p>
<p><strong>Using # Carefully in the robots.txt file:</strong><br />
Adding comments after the syntax commands is not a good idea using &#8220;#&#8221;. Some robots might misinterpret the line although it is acceptable as per the robots exclusion standard. New lines are always preferred for comments.</p>
<p><strong>Using the robots.txt file</strong></p>
<p>* Robots are configured to read text. Too much graphic content could render your pages invisible to the search engine. Use the robots.txt file to block irrelevant and graphic-only content.</p>
<p>* Indiscriminate access to all files, it is believed, can dilute relevance to your site content after being indexed by robots. This could seriously affect your site&#8217;s ranking with search engines. Use the robots.txt file to direct robots to content relevant to your site&#8217;s theme by blocking the irrelevant files or directories.</p>
<p>* The robots.txt file can be used for multilingual websites to direct robots to relevant content for relevant topics for different languages. It ultimately helps the search engines to present relevant results for specific languages. It also helps the search engine in its advanced search options where language is a variable.</p>
<p>* Some robots could cause severe server loading problems by rapid firing too many requests at peak hours. This could affect your business. By excluding some robots that might be irrelevant to your site, in the robots.txt file, this problem can be taken care of. It is really not a good idea to let malevolent robots use up precious bandwidth to harvest your emails, images etc.</p>
<p>* Use the robots.txt file to block out folders with sensitive information, text content, demo areas or content yet to be approved by your editors before it goes live.</p>
<p>The robots.txt file is an effective tool to address certain issues regarding website ranking. Used in conjunction with other SEO strategies, it can significantly enhance a website&#8217;s presence on the net.</p>
<p>DM</p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamouse.biz/blog/2008/10/working-with-robotstxt/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Googlebots and Leading the Blind: Dynamic Pages and SEO</title>
		<link>http://www.datamouse.biz/blog/2008/06/googlebots-and-leading-the-blind-dynamic-pages-and-seo/</link>
		<comments>http://www.datamouse.biz/blog/2008/06/googlebots-and-leading-the-blind-dynamic-pages-and-seo/#comments</comments>
		<pubDate>Sun, 15 Jun 2008 14:58:02 +0000</pubDate>
		<dc:creator>DataMouse</dc:creator>
				<category><![CDATA[SEO]]></category>
		<category><![CDATA[Crawl]]></category>
		<category><![CDATA[Dynamic]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[Links]]></category>
		<category><![CDATA[PHP]]></category>

		<guid isPermaLink="false">http://www.datamouse.biz/blog/wordpress/?p=52</guid>
		<description><![CDATA[Dynamic pages are great for web designers - but poor for search engines.
The purpose of this article is to show you how you can index your dynamic pages with the search engines, and some of the pitfalls that you can avoid when submitting these to Google et al.]]></description>
			<content:encoded><![CDATA[<p>Dynamic pages are fantastic. They mean less work for web designers, who can create a single webpage and change its content through a database connection. They’re quicker loading for visitors (after the first load), as the graphical elements are cached by most web browsers and only the content changes. They allow for great flexibility in content, such as shopping sites, blogs and more.</p>
<p>However, before you get excited about dynamic pages and race off to learn PHP, there is a downside to these web pages: Search engine bots cannot see them. And if they can’t be seen by the bots, they can’t be indexed by the engines.</p>
<p>The purpose of this article is to show you how you can index your dynamic pages with the search engines, and some of the pitfalls that you can avoid when submitting these to Google et al.</p>
<p><strong>What is a Dynamic Page?</strong></p>
<p>As mentioned above, a dynamic page is a static webpage that dynamically pulls it’s content from an external source – usually a web database such as mySQL.</p>
<p>This has many benefits, not least of all for the designer, whom only needs to create one webpage design. If he wishes to update his site, he need only change the one page and all pages presented to the end user will be changed too (as they all use the same template and change content dynamically, of course).</p>
<p>So you can use dynamic pages any time that you have information that is categorized by date or if you use an ecommerce shopping sites. Dynamic pages are also compatible with multiple service providers; so no need to worry about cross-browser support or migrating your site to another host (migrating the database is another issue).</p>
<p>You can spot dynamic pages as you surf around the web by looking at their URLs. Dynamic pages contain the character “?” or one of “#&amp;*!%” within their link, such as <a href="http://www.yourdomain.com/?page=10">www.yourdomain.com/?page=10</a></p>
<p><strong>So What is the Issue?</strong></p>
<p>The issue is the characters used in the URL. These instruct the webpage to send a command to the web server to fetch the content from the web database. So, in essence, the content isn’t retrieved unless the web page asks for it.</p>
<p>Googlebots, and other bots for that matter, cannot send commands to the web server, and so cannot retrieve the content. Remember: if they cannot see it, they cannot index it.</p>
<p>Additionally, some black hat SEO people and spammers use the same characters to trap search engine spiders. When you consider both of these reasons, it’s not hard to see why the spiders are both unable and unwilling (if you can say such a thing about a program) to index your dynamic pages.</p>
<p><strong>So I’m Stuck Then?</strong></p>
<p>Not at all. As webmasters, we have to be a little more imaginative in dealing with the issue of non-indexing and there are plenty of white hat SEO options available to you.</p>
<blockquote><p><strong>Make regular static pages that link to your dynamic pages</strong><br />
This one is really simple. On areas of your site that are static (and therefore can be indexed), add a link to your dynamic page.</p>
<p>Google and the others will be able to access the page via this link, and index the result. Also, please remember to optimise your anchor text too. You can read how to <a href="http://www.datamouse.biz/blog/wordpress/?p=51">optimise anchor text</a> in another article.</p></blockquote>
<blockquote><p><strong>Make blogs or blog posts that link to your dynamic pages</strong><br />
Similar to the previous suggestion, but involves linking to one dynamic page from another. The difference is that, as well as linking from your blog/post, your blog post would also be submitted to various websites or feeds, such as Digg, DropJack etc.</p>
<p>As these sites are heavily crawled by Google, they will also index your blog articles. Once these are indexed, their “child” links will also be picked up too.</p></blockquote>
<blockquote><p><strong>Optimize any dynamic pages that need to be indexed</strong><br />
Just because your page is dynamic, this is no reason to ignore the importance of on-the-page SEO.</p>
<p>Look at your <a href="http://www.datamouse.biz/blog/wordpress/?p=39">metatags</a>, including the title, and optimise wherever possible. Optimise the content itself. Good content attracts both search engines and visitors. Check your keywords and their densities. In short, do everything that you would do if the page were static.</p></blockquote>
<blockquote><p><strong>Articles and content pages</strong><br />
If your dynamic pages are articles, such as a blog post, submit the article to webfeeds and sites, such as Digg.</p>
<p>As mentioned earlier, these sites are heavily crawled and indexed, and, even with a link to a dynamic page, Google will index your page with ease.</p></blockquote>
<blockquote><p><strong>Link to your dynamic pages with a table of contents</strong><br />
Create a single page sitemap of your dynamic pages. Do this for your web visitors – not for the search engines – and optimise the anchor text.</p>
<p>The benefit of this is that the static sitemap page will be indexed quickly, depending on the directory depth of the page, and, likewise, the pages it links to will also be indexed.</p></blockquote>
<blockquote><p><strong>Rewrite Your Page URLs with .htaccess</strong><br />
This is an extremely powerful, but potentially complex, means of dealing with the issue of dynamic page indexing.</p>
<p>Basically, a rewrite can change your url from <em>yourdomain.com/?page=10</em> to <em>yourdomain.com/dir/page10/</em>. This works with Apache servers, and is well worth considering, if not implementing.</p>
<p>This rewrite rule will convert pages like this <em>yourdomain/posts.php?page=1</em> to <em>yourdomain/posts/page1</em></p>
<p><em>Options +FollowSymlinks<br />
RewriteEngine on<br />
RewriteRule ^files/([^/]+)/([^/]+).zip /posts.php?page=$1&amp;file=$2 [NC]</em></p>
<p>Of course, this assumes that you are using PHP as your dynamic page language.</p>
<p>If you are interested in this option, the Apache documentation covers <em>mod_rewrite</em> in much more detail.</p></blockquote>
<p>So if you really need dynamic pages, remember to set them up so that the Googlebots can see and record all the information on your site. As illustrated this is not an impossible task. It is just a question of working within the Google rules to help those bots to read all the information in your site.</p>
<p>DM</p>
]]></content:encoded>
			<wfw:commentRss>http://www.datamouse.biz/blog/2008/06/googlebots-and-leading-the-blind-dynamic-pages-and-seo/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>
