PDA

View Full Version : Google's new Sitemap protocol


niko
16th June 2005, 09:47 AM
This is very important to webmasters regarding providing an additional means to ensure webcrawling of your web site.

Overview

The Sitemap Protocol allows you to inform search engine crawlers about URLs on your Web sites that are available for crawling. A Sitemap consists of a list of URLs and may also contain additional information about those URLs, such as when they were last modified, how frequently they change, etc.

Sitemaps are particularly beneficial when users can not reach all areas of a Web site through a browseable interface — i.e. users are unable to reach certain pages or regions of a site by following links. For example, any site where certain pages are only accessible via a search form would benefit from creating a Sitemap and submitting it to search engines.

This document describes the formats for Sitemap files and also explains where you should post your Sitemap files so that search engines can retrieve them.

Please note that the Sitemap Protocol supplements, but does not replace, the crawl-based mechanisms that search engines already use to discover URLs. By submitting a Sitemap (or Sitemaps) to a search engine, you will help that engine's crawlers to do a better job of crawling your site.

Using this protocol does not guarantee that your Web pages will be included in search indexes. In addition, using this protocol may not influence the way your pages are ranked by a search engine.

XML Sitemap Format

The XML Sitemap Format allows you to provide a list of URLs and include additional information about those URLs in your Sitemap. This additional information includes the date the content at that URL last changed, how often that content can be expected to change and how important that URL is relative to other URLs on your site.

The XML Sitemap Format uses the following XML tags:

* changefreq — how frequently the content at the URL is likely to change
* lastmod — the time the content at the URL was last modified
* loc — the URL location
* priority — the priority of the page relative to other pages on the same site
* url — this tag encapsulates the first four tags in this list
* urlset — this tag encapsulates the first five tags in this list

Note: All data values, including URLs, in your Sitemap files must be XML-encoded. The chart below provides a list of characters with their corresponding encoded values. You can use either the entity or the character code to XML encode a character. Please see the FAQ for more information about XML encoding.


Escaped Forms
Character Entity Character Code
Ampersand & & &
Single Quote ' ' '
Double Quote " " "
Greater Than > > >
Less Than < &lt; <


Sample XML Sitemap

The following example shows a Sitemap in XML format. The Sitemap in the example contains a small number of URLs, each of which is identified using the loc XML tag. In this example, a different set of optional parameters has been provided for each URL.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.google.com/schemas/sitemap/0.84">
<url>
<loc>http://www.yoursite.com/</loc>
<lastmod>2005-01-01</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>http://www.yoursite.com/catalog?item=12&amp;desc=vacation_hawaii</loc>
<changefreq>weekly</changefreq>
</url>
<url>
<loc>http://www.yoursite.com/catalog?item=73&amp;desc=vacation_new_zealand</loc>
<lastmod>2004-12-23</lastmod>
<changefreq>weekly</changefreq>
</url>
<url>
<loc>http://www.yoursite.com/catalog?item=74&amp;desc=vacation_newfoundland</loc>
<lastmod>2004-12-23T18:00:15+00:00</lastmod>
<priority>0.3</priority>
</url>
<url>
<loc>http://www.yoursite.com/catalog?item=83&amp;desc=vacation_usa</loc>
<lastmod>2004-11-23</lastmod>
</url>
</urlset>

You can compress your Sitemap files using gzip. Compressing your Sitemap files will reduce your bandwidth requirement. Please note that your uncompressed Sitemap file may not be larger than 10MB.

Note: Your Sitemap files must use UTF-8 encoding.

XML Tag Definitions

This section provides details about the XML tags that can appear in your Sitemap(s). In the "Subtags" section of some of the XML tag definitions, a question mark ("?") appearing after the name of an XML tag indicates that the tag is optional.

-----------------------------------------------

changefreq

Definition


Optional. This value indicates how frequently the content at a particular URL is likely to change. The value must be either "always", "hourly", "daily", "weekly", "monthly", "yearly" or "never". The value "always" should be used to describe documents that change each time they are accessed. The value "never" should be used to describe archived URLs.

Please note that the value of this tag is considered a hint and not a command. Even though search engine crawlers consider this information when making decisions, they may crawl pages marked "hourly" less frequently than that, and they may crawl pages marked "yearly" more frequently than that. It is also likely that crawlers will periodically crawl pages marked "never" so that they can handle unexpected changes to those pages.

Constraints


Enumerated list. Valid values are "always", "hourly", "daily", "weekly", "monthly", "yearly" and "never".

Example
<changefreq>monthly</changefreq>

Subtag of
url

Content Format


Text


---------------------------------------------

lastmod

Definition


Optional. The time the URL was last modified. You should specify the timestamp using ISO 8601; for example, 2004-09-22T14:12:14+00:00. You can omit the time portion of the ISO 8601 format; for example, 2004-09-22 is also valid. This information allows crawlers to avoid recrawling documents that haven't changed.

Constraints


Value must be in ISO 8601 format.

Example
<lastmod>2005-02-21</lastmod>
or
<lastmod>2005-02-21T18:00:15+00:00</lastmod>

Subtag of
url

Content Format


Text

-----------------------------------------------

loc

Definition


Required. A URL for a page on your site.

Constraints


Value must be <= 2048 characters.

Example
<loc>http://www.yoursite.com/catalog?item=1&amp;desc=vacation_hawaii</loc>

Subtag of
url

Content Format


Text

-----------------------------------------------

priority

Definition


Optional. The priority of a particular URL relative to other pages on the same site. The value for this tag is a number between 0.0 and 1.0, where 0.0 identifies the lowest priority page(s) on your site and 1.0 identifies the highest priority page(s) on your site.

The default priority of a page is 0.5.

Please note that the priority you assign to a page has no influence on the position of your URLs in a search engine's result pages. Search engines use this information when selecting between URLs on the same site, so you can use this tag to increase the likelihood that your more important pages are present in a search index.

Also, please note that assigning a high priority to all of the URLs on your site will not help you. Since the priority is relative, it is only used to select between URLs on your site; the priority of your pages will not be compared to the priority of pages on other sites.

Constraints


Value must be between 0.0 and 1.0 inclusive.

Example
<priority>0.7</priority>

Subtag of
url

Content Format


Text


-----------------------------------------------

url

Definition


Encapsulates information about a particular URL.

Subtags
changefreq?, lastmod?, loc, priority?

Subtag of
urlset

Content Format


Empty


-----------------------------------------------

urlset

Definition


Encapsulates information about all of the URLs in a Sitemap file.

Subtags
url

Content Format


Empty


-----------------------------------------------




Providing Multiple Sitemap Files

You can provide multiple Sitemap files, but each file that you provide must have no more than 50,000 URLs and must be no larger than 10MB (10,485,760) when uncompressed. These limits help to ensure that your Web server does not get bogged down serving very large files.

If you want to list more than 50,000 URLs, you must create multiple Sitemap files. If you anticipate your Sitemap growing beyond 50,000 URLs or 10MB, you should consider creating multiple Sitemap files. If you do provide multiple Sitemaps, you must list them in a Sitemap index file. Sitemap index files may not list more than 1,000 Sitemaps. Your Sitemap index file could be named Sitemap_index.xml.

The XML format of a Sitemap index file is very similar to the XML format of a Sitemap file. The Sitemap index file uses the following XML tags:

* lastmod
* loc
* sitemap
* sitemapindex

Note: A Sitemap index file can only specify Sitemaps that are found on the same site as the Sitemap index file. For example, http://www.yoursite.com/sitemap_index.xml can include Sitemaps on http://www.yoursite.com but not on http://www.mysite.com or http://yourhost.yoursite.com.

Sample XML Sitemap Index

The following example shows a Sitemap index in XML format. The Sitemap index lists two Sitemaps:
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.google.com/schemas/sitemap/0.84">
<sitemap>
<loc>http://www.mysite.com/sitemap1.xml.gz</loc>
<lastmod>2004-10-01T18:23:17+00:00</lastmod>
</sitemap>
<sitemap>
<loc>http://www.mysite.com/sitemap2.xml.gz</loc>
<lastmod>2005-01-01</lastmod>
</sitemap>
</sitemapindex>

Note: Sitemap URLs, like all values in your XML files, must be XML-encoded.

Sitemap Index XML Tag Definitions

*

The loc tag is required and identifies the location of the Sitemap.
*

The lastmod tag is an optional tag that identifies the time that the corresponding Sitemap file was modified. It does not correspond to the time that any of the pages listed in that Sitemap were changed. The value for the lastmod tag should be in ISO 8601 format.

By providing the last modification timestamp, you enable search engine crawlers to retrieve only a subset of the Sitemaps in the index — i.e. a crawler could only retrieve Sitemaps that were modified since a certain date. This incremental Sitemap fetching mechanism allows for the rapid discovery of new URLs on very large sites.
*

The sitemap tag encapsulates information about an individual Sitemap.
*

The sitemapindex tag encapsulates information about all of the Sitemaps in the file.




Location of Sitemap Files

The location of a Sitemap file determines the set of URLs that can be included in that Sitemap. A Sitemap file located at http://yoursite.com/catalog/sitemap.gz can include any URLs starting with http://yoursite.com/catalog/ but can not include URLs starting with http://yoursite.com/images/.

If you have the permission to change "http://site.org/path/sitemap.gz", it is safe to assume that you also have permission to provide information for URLs with the prefix "http://site.org/path/". Examples of URLs considered valid in http://yoursite.com/catalog/sitemap.gz include:

http://yoursite.com/catalog/show?item=23
http://yoursite.com/catalog/show?item=233&user=3453

URLs not considered valid in http://yoursite.com/catalog/sitemap.gz include:

http://yoursite.com/image/show?item=23
http://yoursite.com/image/show?item=233&user=3453
http://mysite.com/catalog/show?item=24

URLs that are not considered valid are dropped from further consideration. It is strongly recommended that you place your Sitemap at the root directory of your web server. For example, if your HTTP Web server is at yoursite.com, then your Sitemap index file would be at "http://yoursite.com/sitemap.gz". In certain cases, you may need to produce different Sitemaps for different paths — e.g. if security permissions in your organization compartmentalize write access to different directories.



Frequently Asked Questions ]
How do I XML-encode a URL?

Does it matter which character encoding method I use to generate my Sitemap files?

How do I specify time?

How do I compute lastmod date?

Where do I place my Sitemap?

How big can my Sitemap be?

My site has tens of millions of URLs; can I somehow submit only those that have changed recently?

What happens after I produce my Sitemap?

Do URLs in the Sitemap need to be completely specified?

My site has both "http" and "https" version of URLs. Do I need to list both?

URLs on my site have session IDs in them. Do I need to remove them?

Does position of a URL in a Sitemap influence its use?

Some of the pages on our site use frames. Should we include the frameset URLs or the URLs of the frame contents?

Can I zip my Sitemaps or do they have to be gzipped?

Will the "priority" hint in the XML Sitemap change the ranking of my pages in search results?

Is there an XML schema that I can validate my XML Sitemap against?


Q: How do I XML-encode a URL?

To properly encode your URLs, follow the procedure recommended by the HTML 4.0 specification, section B.2.1. Convert the string to UTF-8 and then URL-escape the result. For details about Internationalized Resource Identifiers, also see RFC2396 (sections 2.3 and 2.4) and RFC3987.

The following is an example python script for XML encoding a URL:

$ python
Python 2.2.2 (#1, Feb 24 2003, 19:13:11)
>>> import xml.sax.saxutils
>>> xml.sax.saxutils.escape("http://www.test.org/view?widget=3&count>2")

The encoded URL from the example above is:

http://www.test.org/view?widget=3&amp;count&gt;2

Q: Does it matter which character encoding method I use to generate my Sitemap files?

Yes. Your Sitemap files must use UTF-8 encoding.

Q: How do I specify time?

Use ISO 8601 encoding for the lastmod timestamps and all other dates and times in this protocol. For example, 2004-09-22T14:12:14+00:00.

If you wish, you can omit the time portion of the ISO8601 format; for example, 2004-09-22 is also valid. However, if your site changes frequently, you are encouraged to include the time portion so crawlers have more complete information about your site.

Q: How do I compute lastmod date?

For static files, this is the actual file update date. You can use the UNIX date command to get this date:
$ date --iso-8601=seconds -u -r /home/foo/www/bar.html
>> 2004-10-26T08:56:39+00:00

For many dynamic URLs, you may be able to easily compute a lastmod date based on when the underlying data was changed or by using some approximation based on periodic updates (if applicable). Using even an approximate date or timestamp can help crawlers avoid crawling URLs that have not changed. This will reduce the bandwidth and CPU requirements for your Web servers.

Q: Where do I place my Sitemap?

It is strongly recommended that you place your Sitemap at the root directory of your HTML server; that is, place it at http://yoursite.com/sitemap.gz.

In some situations, you may want to produce different Sitemaps for different paths on your site — e.g. if security permissions in your organization compartmentalize write access to different directories.

If you have the permission to change http://site.org/path/sitemap, then it is generally safe to assume that you also have permission to report metadata under http://site.org/path/.

Q: How big can my Sitemap be?

Search engines will not process Sitemaps larger than 10MB (10,485,760 bytes) in length when uncompressed or that contain more than 50,000 URLs. This means that if your site contains more than 50,000 URLs or your Sitemap is bigger than 10MB, you must create multiple Sitemap files and use a Sitemap index file. You should use a Sitemap index file even if you have a small site but plan on growing beyond 50,000 URLs or a filesize of 10MB.

Q: My site has tens of millions of URLs; can I somehow submit only those that have changed recently?

You can list the updated URLs in a small number of Sitemaps that change frequently and then use the lastmod tag in your Sitemap index file to identify those Sitemap files. Search engines will then incrementally crawl only the changed Sitemaps.

Q: What happens after I produce my Sitemap?

After you produce your Sitemap, you will need to notify search engines of the Sitemap's location. The search engines that you notify will then retrieve your Sitemap and make the URLs available to their crawlers.

Q: Do URLs in the Sitemap need to be completely specified?

Yes. Search engines will crawl the URLs exactly as you provide them. (Search engines will XML decode your URLs if they are XML-encoded.) You do need to include the protocol — e.g. http — in your URL; you also need to include a trailing slash in your URL if your Web server requires one. For example, http://www.google.com/ is a valid URL for a Sitemap, whereas www.google.com is not.

Q: My site has both "http" and "https" version of URLs. Do I need to list both?

No. Please list only one version of a URL in your Sitemaps. Including multiple versions of URLs may result in incomplete crawling of your site.

Q: URLs on my site have session IDs in them. Do I need to remove them?

Yes. Including session IDs in URLs may result in incomplete and redundant crawling of your site.

Q: Does position of a URL in a Sitemap influence its use?

No. The position of a URL in the Sitemap has no impact on how it is used or regarded by search engines.

Q: Some of the pages on our site use frames. Should we include the frameset URLs or the URLs of the frame contents?

Please include both URLs.

Q: Can I zip my Sitemaps or do they have to be gzipped?

Please use gzip to compress your Sitemaps.

Q: Will the "priority" hint in the XML Sitemap change the ranking of my pages in search results?

No. The "priority" hint in your Sitemap only indicates the importance of a particular URL relative to other URLs on your own site.

Q: Is there an XML schema that I can validate my XML Sitemap against?

An XML schema is available for Sitemap files at http://www.google.com/schemas/sitemap/0.84/sitemap.xsd, and a schema for Sitemap index files is available at http://www.google.com/schemas/sitemap/0.84/siteindex.xsd. You can read more about validating your Sitemap here.



https://www.google.com/webmasters/sitemaps/login

digitalfunstuff
19th June 2005, 08:46 AM
Hi Niko, This look a little confusing.... Have you tried it yet? Has anyone tried it?
It looks like you need a programming degree to work it all out.
Anyhow, will try to work it out.

Annie

niko
20th June 2005, 05:03 PM
Hi Annie

I haven't had time, but it looks like a great idea because I do believe that if there is an opportunity to put your site one spot above that infernal competitor who seems to always shove your site down is a great opportunity.

I already send updates manually to googles addurl but the advantage of this is that you can specify how often you update and what pages the spider should visit. Although as an Ezimerchant user the category restriction doesn't help to describe the products for the htm pages and I do believe an option to change the name of the page to a specific name would help considerably our cause.

I also believe linking to quality sites and they do the same in return is of much benefit, sometimes even mispelling words is useful too because you do capture the illiterate or lazy fingers. I also believe the global keywords should be restricted to the main page and then more specific as you go down each level is extremely important. I would advise all ezimerchant members to become a google member by getting a gmail and seeing what this add on service can do for you. I do believe that Ezimerchant should build a links page with all of the registered ezimerchant members web sites listed plus even have a portal based on product type.

an example of a portal is www.octapc.com you will see someone has taken the web site and just uses it to direct people to paid advertisers. They do this so that all web sites that have been let slip that had good traffic would in total attract volume click throughs which means money for the owner of all of these web sites collected.

I am aware the majority of ezimerchant users are not wanting to spend too much time on the fiddly bits of getting their web sites up for that extra spot or two but rather prefer an automated solution to this.

I would love to help everyone that asks but I am struggling to maintain 4 web sites with another one that needs a total build. If only I could copy and paste one web site to all the others with easy modifications (title etc) but nothing is simple.

dferguson
21st June 2005, 12:13 PM
We are going to work on adding the sitemap XML to the generated ezimerchant website - So you wont need to learn to program to benifit.

:-P

digitalfunstuff
22nd June 2005, 05:33 AM
Hi David,

What a wonderful thing!!! Please keep us informed when it will be ready.

Dkiss
4th July 2005, 09:53 AM
Does this mean that the hard coded site maps on my site are out of date?

digitalfunstuff
27th September 2005, 06:01 AM
Hi David,
Any progress?

creeky
27th February 2007, 09:41 AM
thought i would bump this mainly as seemed dead and no more info forthcoming

at the moment i am interested in how far off this feature is as i have already generated a basic sitemap script for my site but need to contact my hoster everytime i want it to run and paying for a cron task to be setup will only cover every 2 weeks worth of change

dferguson
27th February 2007, 11:11 AM
We have implemented Google Site map into ezimerchant Professional V4.0. However after reading ALL of Googles own documentation I can hardly see how it will help anyone with an ezimerchant site. The only function of the sitemap is to ensure google finds all the pages on your website. ezimerchant sites use standard HTML links to each and every page, therefore there is no reason why google cannot find every active page.

creeky
5th March 2007, 12:40 AM
Correct, however my experience in starting up in this crazy world of estores and webpages has shown that -

The only way to fastrack the indexing of the site is by creating a sitemap and having it uploaded into your google webmaster area.

While I only had a few items in my store to begin with, not much more than my first page was being indexed, even after the first couple of months, bear in mind that they only crawl your home page on average every 2 weeks.

Once I had the the xml sitemap created and uploaded, bingo less than 2 weeks later along with a massive 50 pages crawled in one day, the whole lot was indexed. Average crawl for me is about 8 pages a day at the moment.

For an ezimerchant customer who wants everyone to go through their front page only then no they dont need this feature.

But for those like me that would rather have someone search for the product they are after and have google list the direct page that it is on then this is a good help.

I am attending a SEO training day this month, and will share any other findings or elaborations that I come across on the black arts that google employs :D