WordPress robots.txt tips against duplicate content
Posted by Dan | Posted in Shoemoney | Posted on 03-03-2008
0
Been getting some questions about my robots.txt file and what certain things do.
Thankfully some regular expressions are supported in the robots.txt (but not many).
$ in regex means the end of the file. So if you do .php$ it your robots.txt that means it will match anything that ends in .php
This is really handy when you want to block all .exe .php or other files. For example:
Disallow: /*.PDF$
Disallow: /*.jpeg$
Disallow: /*.exe$
Specifically this is some of the things I use in my robots.txt
Disallow: /*? – this blocks all urls with a ? in them. A good way to avoid duplicate content issues with wordpress blogs. Obviously you only want to use this if you have changed your url structure to not be 100% ?=.
Disallow: /*.php$ – This blocks all .php files. Another good way to avoid duplicate content with a wordpress blog.
Disallow: /*.inc$ – you should not be showing .inc or include files to bots (google code search will eat you alive)
Disallow: /*.css$ – why would you show css files for indexing seems silly.. The wildcard is used here in case there are many css files.
Disallow: */feed/ feeds being indexed dilute your site equity. The wildcard * is used incase there is preceding chars.
Disallow: */trackback/ – no reason a trackback url should be indexed. The wildcard * is used incase there is preceding chars.
Disallow: /page/ – assloads of duplicate content in pages for wordpress.
Disallow: /tag/ – more douplicate content.
Disallow: /category/ – even more duplicate content.
SO what if you want to ALLOW a page. Like for instance my serps tool is serps.php and from the above rules that would not fly.
Allow: /serps.php – this does the trick!
Keep in mind I am not a SEO but I have picked up a few tricks along the way.


