@<> @<> @<> @<> @<>

gen-sitemap: a generator of sitemap

Doing better than google ;-)

The gen-sitemap is a simple tools to generate the sitemap.xml files as described in sitemaps.org. These sitemap are used by searching engines to optimize the crawl of own site.

This program is similar to the Google sitemap generator, but with a simple and more intuitive configuration file, and probably more powerfull.

Features

  • Simple text based configuration file, similar to robots.txt (Google uses xml)
  • Options (priority, changefreq, ...) could be specified per directory and per file type.
  • Powerfull but still simple syntax to filter files
  • Capable to read wikimedia sitemap files (compressed filemap)
  • Capable to ping google and yahoo for new sitemap files
  • Could handle sitemap indexes and multiple files
  • Don't touch sitemap that are not changed
  • Open source (GPL v2)

But there are also features/bugs:

  • gen-sitemap doesn't split explicity big file (but it warn).
  • Don't use apache logs to build sitemap.

Configuration

Configuration is a UTF-8 text file, fields are space separated. double quotes are removed. Inside double quotes, spaces doesn't split fields and quoting baskslash are removed.

Line starting with # are comments and they ignored by the parser. Options can be put in any order, after the command arguments.

Commands

base-url URL

Set the base URL. All subsequent URL are relative to this URL. Normally it consist only on the protocol and the host name.

Examples:

base-url http://www.example.net
base-url "http://www.example.net/"

base-path PATH

Set the base PATH. This directory correspond to what a browser see in base-url. All subsequent paths are relative to this path.

Example:

base-path /pub/www.example.net/

sitemap PATH [PATH]

Define the output file for sitemap. Default is sitemap.xml.gz. If PATH is relative, it is relative to base-path. It there are two PATHs, the first PATH is used as index sitemap and the second as pattern for sitemap files: the * is used as placeholder for sitemap number.

PATH should terminate with .xml for standard non compressed xml or .xml.gz for compressed filemap. The suffix determines what kind of sitemap to generate.

add-url: PATH [changefreq=FREQ] [priority=PRIORITY] [lastmod=TIME] [class=CLASS]

Add the relative (to base-url) URL to the sitemap. See the section OPTIONS for the description of the options

Example:

add-url project/dynamic.php?blog="My My"&date="tomorrow" changefreq=weekly

filter-reset:

Reset (cancel) actual filters. Usefull when using different filters for different parts of the web site. See the section FILTERS for more information about filters usage.

Example:

filter-reset:

filter-ignore: MASK

Add a new filter to ignore MASK. See the section FILTERS for more information about filters usage.

Examples: (see FILTER section)

filter-add: MASK [changefreq=FREQ] [priority=PRIORITY] [lastmod=TIME] [index=INDEX] [setlastmod=True|False] [class=CLASS]

Add a new filter, eventually setting options for file matching this filter.. See the section OPTIONS for the option. See the section FILTERS for more information about filters usage.

Examples: (see FILTER section)

add-dir: PATH [changefreq=FREQ] [priority=PRIORITY] [lastmod=TIME] [index=INDEX] [setlastmod=True|False] [class=CLASS]

Check all files in the directory PATH (relative to base-path), and recurse in sub directoriess. If PATH is a file, parse only PATH. Use the filters to see what to add and what options to use. Hidden file (starting with ".") are ignored.

Examples:

add-dir . setlastmod=True
add-dir interesting-dir priority=1.0

include: PATH [ignore-filters=True|False] [changefreq=FREQ] [priority=PRIORITY] [lastmod=TIME] [index=INDEX] [setlastmod=True|False] [class=CLASS]

Include and parse a new configuration file in PATH (relative to base-bath). The filters set in the included file are discarded at the return. If ignore-filters is True the actual filters are not passed in the new file.

Examples:

include: second.sitemap ignore-filters=True

add-list: PATH [ignore-filters=True|False] [changefreq=FREQ] [priority=PRIORITY] [lastmod=TIME] [index=INDEX] [setlastmod=True|False] [class=CLASS]

Parse a list of files. Use the provided defaults, and the filters.

Examples:

add-list: automatic.sitemap priority=0.3 setlastmod=True
add-list: second.sitemap ignore-filters=True

add-sitemap: PATH [base-url=URL] [class=CLASS] [a-changefreq=FREQ] [a-priority=PRIORITY] [a-lastmod=TIME] [url-index=URL]

Include the URL from a sitemap, replacing the base-url provided as option with the base URL provvided by the base-url command. The other options are used as default, or to adjust values provided by the sitemap. The url-index is appended as prefix to the URL in the sitemap index. This is used to correct the wikimedia sitemap, which don't includes the base URL in the sitemap index.

Example: Add a wikimedia sitemap

run: "cd wiki ; php maintenance/generateSitemap.php --server=http://cateee.net"
add-sitemap: wiki/sitemap-index-wikidb-wk_.xml url-index=http://localhost/wiki/ a-priority=-0.4

ping: URL

After generating the sitemap file, pings the URL if the gen-sitemap option --notify. Usefull to notify web engines. Remember to set your sitemap URL in the ping URL.

Examples:

ping: "http://www.google.com/webmasters/sitemaps/ping?sitemap=http://localhost/sitemap.xml.gz"
ping: "http://search.yahooapis.com/SiteExplorerService/V1/ping?sitemap=http://localhost/sitemap.xml.gz"

run: PATH

Run immediately the script PATH (relative to base-path), if the gen-sitemap option --allow-run is set. Consider security implication enablig such options.

Example: (see add-sitemap)

Options

changefreq=FREQ

The usual frequency of page changes.

The valid values (from sitemaps.org) of FREQ are: always, hourly, daily, weekly, monthly, yearly, never.

priority=PRIORITY

The relative priority of the page (to see order the crawl downloads. This program use only the digit-dot-digit format.

The range of valid values (from sitemaps.org) is 0.0 to 1.0.

lastmod=TIME

This option indicate the last modification of the page, using the W3C time format. i.e.: lastmod=2007-12-30|2008-12-31T23:45Z|....

class=CLASS

Set the actual URL as member of sitemap index CLASS. This option is used to split huge quantities of URL into multiple sitemap. You should divide che URL in class of usage, so not to update all sitemaps.

index=INDEX

Use directory alias instead of index file specified with INDEX. Don't set (or set it with an invalid value (i.e. include a slash) to ignore these aliases. The filters will check the INDEX file, but it will write only the directory to sitemap.

Example:Set index=index.php to use i.e. dir/ instead of dir/index.php in the sitemap.

setlastmod=True|False

If setlastmod=True is set, the lastmod is set reading the data from directories

ignore-filters=True|False

If ignore-filters=True, the actual command will ignore the actual filters.

Filters

The filters are a powerfull methods to select files to be included in filemap, and to select options using pattern.

There are two times of mask: the directory mask, which ends with a slash / and the file masks (non slash terminating slash.

All masks match from the beginning of the string. The directory masks match only the directory part (complete or partial). Instead the other mask should match the complete filename.

There is two special characters: * and the **.

The single star is uses like expected: any number of characters in a path component (i.e. any character without slash /). You cah use multiple * in the mask (i.e. */2007-*-blogs/index_*.*).

The double star could match also the slash, so it can includes any number of directories.

Examples

*.html		# all files with suffix ".html" on base-path directory
*/*.html	# all files with suffix ".html" on the first level sub-directories
**/*.html	# all files with suffix ".html" under the base-path, but not in the base-path directory
**.html		# all files with suffix ".html" under the base-path
# but:
**index.html	# could include old_index.html and dir/old_index.html
**/index.html	# use these two lines to match all index.html files
index.html

Order of filters

The default is to match the file. The filter are tested in order and the result (ignore or add) is given by the first non-directory match (mask not ending with slash). On positive result the next filters are parsed to complete the missing options.

So usually the filter are written in blocks: first block contain the directory masks (from particolar to general). Then it includes a block with particular files or directory, and at the end the general patterns. Eventually with filter-reset we can have rules for special part of the web site.

Examples:

filter-reset:
# directory block
filter-ignore: logs/		# ignore non accessible or non public directories
filter-ignore: tools/		# non public directory
filter-ignore: wiki/		# wikimedia prefer rewrite: let use virtual directory.
# file block
filter-ignore: y_key_*.html	# Y! sitemaster verification file
filter-ignore: google*.html	# google sitemaster verification file
filter-add: **.html		# add all html files
filter-add: robots.txt		# and robots.txt
filter-ignore: **		# and ignore all other files
# option block (files have already a match)
filter-add: index.html            priority=1.0
filter-add: important/index.html  priority=1.0
filter-add: */index.html          priority=0.9
filter-add: **/index.html         priority=0.7
filter-add: important/*.html      priority=0.7
filter-add: important/**.html     priority=0.6
filter-add: important/news.html	 changefreq=weekly
filter-add: talks/*/		 index="/none/"  changefreq=never

Running gen-sitemap

The gen-sitemap is a python script, so you need a python interpreter to run the script.

It support the followin options:

-q, --quiet
don't print warnings and non fatal error messages
-v, --verbose
be more verbose
-o FILE, --output=FILE
select the output sitemap. Default is sitemap.xml.gx (or sitemap.xml with --plain)
-c FILE, --conf FILE
Configuration file. (defaut: sitemap.conf)
-p, --plain
save the sitemap without compression
-n, --notify
send notice to web engines (according ping option)

Sources / download

The version of the program is in get-sitemap repository. The program is distributed with the GNU GPL v2 license..

Contact

For bugs report, comments, improvements, etc, but NOT spam you could contact me at cate @ cateee.net.

@<>