返回列表 回复 发帖

雅虎和谷歌都宣布robots.txt文件支持通配符

雅虎和谷歌都宣布robots.txt文件支持通配符


以下是雅虎Search Blog发表的官方声明和robots.txt通配符解决方案:
Yahoo! Search Crawler (Yahoo! Slurp) - Supporting wildcards in robots.txt

I was going through my notes from Danny Sullivan’s Open Feedback sessions that occur during the ?Meet the Crawlers? panel at Search Engine Strategies. One of the items on my list was a request for enhanced syntax in robots.txt to make it easier for webmasters to manage how search crawlers, including Slurp, access your content.

For those who may not be as familiar with search index terminology, webmasters use the robots.txt file to direct robots that visit their site, including search engine crawlers, which files should be crawled and which shouldn’t be. You can read about our support for robots directives in the help for Yahoo! Slurp.

Well, we can scratch that one off the list, since we have just updated Yahoo! Slurp to recognize two additional symbols in the robots.txt directives ? ‘*’ and ‘$’. The semantics of these is what is as widely understood for robots.txt files.

‘*’ - matches a sequence of characters

You can now use ‘*’ in robots directives for Yahoo! Slurp to wildcard match a sequence of characters in your URL. You can use this symbol in any part of the URL string you provide in the robots directive. For example,

User-Agent: Yahoo! Slurp
Allow: /public*/  #允许所有以public开头的目录被索引
Disallow: /*_print*.html
Disallow: /*?sessionid #这句的意思是拒绝一切包含sessionid参数的页面被搜索引擎索引。


The robots directives above will:

allow all directories that begin with ‘public’, such as ‘/public_html/’ or ‘/public_graphs/’ to be crawled
disallow any files or directories which contain ‘_print’, such as ‘/card_print.html’ or ‘/store_print/product.html’ to be crawled
disallow any files with ‘?sessionid’ in their URL string, such as ‘/cart.php?sessionid=342bca31? to be crawled
Note that a trailing ‘*’ is redundant since that is existing matching behavior for Slurp. So, the following two directives are equivalent:

User-Agent: Yahoo! Slurp
Disallow: /private*
Disallow: /private

‘$’ ? anchors at the end of the URL string

You can now also use ‘$’ in robots directives for Slurp to anchor the match to the end of the URL string. Without this symbol, Yahoo! Slurp would match all URLs against the directives, treating the directives as a prefix. For example:

User-Agent: Yahoo! Slurp
Disallow: /*.gif$
Allow: /*?$

The robots directives above will

Disallow all files ending in ‘.gif’ in your entire site. Note that without the ‘$’, this would disallow all files containing ‘.gif’ in their file path
Allow all files ending in ‘?’ to be included. This would not automatically allow files that just contain ‘?’ somewhere in the URL string
As you can see, this symbol only makes sense at the end of the string. Hence, when we see it, we assume that your directive terminates there and any characters after that symbol are ignored.

Oh, by the way, if you thought we didn’t support the ‘Allow’ tag, as you can see from these examples, we do.

If you have any questions about the new syntax or any particular cases you are concerned about, please write in at the Site Explorer forums or read up our area.

Next time you see me at SES, you should ask me what else is on my list!

Priyank Garg
Product Manager, Yahoo! Search
谷歌的robots.txt文件通配符解决方案:
Google's URL removal page contains a little bit of handy information that's not found on their webmaster info pages where it should be.
Google supports the use of 'wildcards' in robots.txt files. This isn't part of the original 1994 robots.txt protocol, and as far as I know, is not supported by other search engines. To make it work, you need to add a separate section for Googlebot in your robots.txt file. An example:

User-agent: Googlebot
Disallow: /*sort= #Googlebot只要看到链接中包含sort=爬虫就停止往下继续索引。


This would stop Googlebot from reading any URL that included the string &sort= no matter where that string occurs in the URL.



拦截 User-agent       

Disallow 行列出的是您要拦截的网页。您可以列出某一特定的网址或模式。条目应以正斜线 (/) 开头。

要拦截整个网站,请使用正斜线。
Disallow: /
要拦截某一目录以及其中的所有内容,请在目录名后添加正斜线。
Disallow: /无用目录/
要拦截某个网页,请列出该网页。
Disallow: /私人文件.html
要从 Google 图片搜索中删除某张特定图片,请添加以下内容:
User-agent: Googlebot-ImageDisallow: /图片/狗.jpg
要从 Google 图片搜索中删除您网站上的所有图片,请使用以下内容:
User-agent: Googlebot-ImageDisallow: /
要拦截某一特定文件类型的文件(例如 .gif),请使用以下内容:
User-agent: Googlebot
Disallow: /*.gif$
要阻止抓取您网站上的网页,而同时又能在这些网页上显示 Adsense 广告,请禁止除 Mediapartners-Google 以外的所有漫游器。这样可使网页不出现在搜索结果中,同时又能让 Mediapartners-Google 漫游器分析网页,从而确定要展示的广告。Mediapartners-Google 漫游器不与其他 Google User-agent 共享网页。例如:
User-agent: *
Disallow: /文件夹 1/User-agent: Mediapartners-Google
Allow: /文件夹 1/
请注意,指令区分大小写。例如,Disallow: /无用文件.asp 会拦截 http://www.cnnas.com/无用文件.asp,但却会允许 http://www.cnnas.com/无用文件.asp。

模式匹配       

Googlebot(但并非所有搜索引擎)遵循某些模式匹配原则。

要匹配连续字符,请使用星号 (*)。例如,要拦截对所有以 private 开头的子目录的访问,请使用以下内容:
User-agent: Googlebot
Disallow: /private*/
要拦截对所有包含问号 (?) 的网址的访问(具体地说,这种网址以您的域名开头、后接任意字符串,然后接问号,而后又接任意字符串),请使用以下内容:
User-agent: Googlebot
Disallow: /*?
要指定与某个网址的结尾字符相匹配,请使用 $。例如,要拦截以 .xls 结尾的所有网址,请使用以下内容:
User-agent: Googlebot
Disallow: /*.xls$
您可将此模式匹配与 Allow 指令配合使用。例如,如果 ? 代表一个会话 ID,那么您可能希望排除包含 ? 的所有网址,以确保 Googlebot 不会抓取重复网页。但是以 ? 结尾的网址可能是您希望包含在内的网页的版本。在此情况下,您可以对您的 robots.txt 文件进行如下设置:

User-agent: *
Allow: /*?$
Disallow: /*?
Disallow: /*? 指令会阻止包含 ? 的所有网址(具体地说,它将拦截所有以您的域名开头、后接任意字符串,然后接问号,而后又接任意字符串的网址)。

Allow: /*?$ 指令将允许以 ? 结尾的任何网址(具体地说,它将允许所有以您的域名开头、后接任意字符串,然后接 ?,? 之后不接任何字符的网址)。
返回列表