The gen-sitemap
is a simple tools to generate the
sitemap.xml files as described in
sitemaps.org.
These sitemap are used by searching engines to optimize the
crawl of own site.
This program is similar to the Google sitemap generator, but with a simple and more intuitive configuration file, and probably more powerfull.
Features
- Simple text based configuration file, similar to robots.txt (Google uses xml)
- Options (priority, changefreq, ...) could be specified per directory and per file type.
- Powerfull but still simple syntax to filter files
- Capable to read wikimedia sitemap files (compressed filemap)
- Capable to ping google and yahoo for new sitemap files
- Could handle sitemap indexes and multiple files
- Don't touch sitemap that are not changed
- Open source (GPL v2)
But there are also features/bugs:
- gen-sitemap doesn't split explicity big file (but it warn).
- Don't use apache logs to build sitemap.
Configuration
Configuration is a UTF-8 text file, fields are space separated. double quotes are removed. Inside double quotes, spaces doesn't split fields and quoting baskslash are removed.
Line starting with #
are comments and they ignored by
the parser. Options can be put in any order, after the command
arguments.
Commands
base-url URL
Set the base URL. All subsequent URL are relative to this URL. Normally it consist only on the protocol and the host name.
Examples:
base-url http://www.example.net base-url "http://www.example.net/"
base-path PATH
Set the base PATH. This directory correspond to what a browser see in base-url
.
All subsequent paths are relative to this path.
Example:
base-path /pub/www.example.net/
sitemap PATH [PATH]
Define the output file for sitemap. Default is sitemap.xml.gz
.
If PATH is relative, it is relative to base-path.
It there are two PATHs, the first PATH is used as index sitemap and the
second as pattern for sitemap files: the *
is used as placeholder
for sitemap number.
PATH should terminate with .xml
for standard non compressed xml
or .xml.gz
for compressed filemap.
The suffix determines what kind of sitemap to generate.
add-url: PATH [changefreq=FREQ] [priority=PRIORITY]
[lastmod=TIME] [class=CLASS]
Add the relative (to base-url
) URL to the sitemap.
See the section OPTIONS for the description of the options
Example:
add-url project/dynamic.php?blog="My My"&date="tomorrow" changefreq=weekly
filter-reset:
Reset (cancel) actual filters. Usefull when using different filters for different parts of the web site. See the section FILTERS for more information about filters usage.
Example:
filter-reset:
filter-ignore: MASK
Add a new filter to ignore MASK. See the section FILTERS for more information about filters usage.
Examples: (see FILTER section)
filter-add: MASK [changefreq=FREQ] [priority=PRIORITY] [lastmod=TIME]
[index=INDEX] [setlastmod=True|False] [class=CLASS]
Add a new filter, eventually setting options for file matching this filter.. See the section OPTIONS for the option. See the section FILTERS for more information about filters usage.
Examples: (see FILTER section)
add-dir: PATH [changefreq=FREQ] [priority=PRIORITY] [lastmod=TIME]
[index=INDEX] [setlastmod=True|False] [class=CLASS]
Check all files in the directory PATH (relative to base-path
), and
recurse in sub directoriess. If PATH is a file, parse only PATH.
Use the filters to see what to add and what options to use.
Hidden file (starting with ".
") are ignored.
Examples:
add-dir . setlastmod=True add-dir interesting-dir priority=1.0
include: PATH [ignore-filters=True|False] [changefreq=FREQ] [priority=PRIORITY]
[lastmod=TIME] [index=INDEX] [setlastmod=True|False] [class=CLASS]
Include and parse a new configuration file in PATH (relative
to base-bath
). The filters set in the included file are discarded at the
return. If ignore-filters
is True
the
actual filters are not passed in the new file.
Examples:
include: second.sitemap ignore-filters=True
add-list: PATH [ignore-filters=True|False] [changefreq=FREQ] [priority=PRIORITY]
[lastmod=TIME] [index=INDEX] [setlastmod=True|False] [class=CLASS]
Parse a list of files. Use the provided defaults, and the filters.
Examples:
add-list: automatic.sitemap priority=0.3 setlastmod=True add-list: second.sitemap ignore-filters=True
add-sitemap: PATH [base-url=URL] [class=CLASS] [a-changefreq=FREQ]
[a-priority=PRIORITY] [a-lastmod=TIME] [url-index=URL]
Include the URL from a sitemap, replacing the base-url
provided
as option with the base URL provvided by the base-url
command.
The other options are used as default, or to adjust values provided by the
sitemap.
The url-index
is appended as prefix to the URL in the sitemap index.
This is used to correct the wikimedia sitemap,
which don't includes the base URL in the sitemap index.
Example: Add a wikimedia sitemap
run: "cd wiki ; php maintenance/generateSitemap.php --server=http://cateee.net" add-sitemap: wiki/sitemap-index-wikidb-wk_.xml url-index=http://localhost/wiki/ a-priority=-0.4
ping: URL
After generating the sitemap file, pings the URL if the gen-sitemap option
--notify
. Usefull to notify web engines.
Remember to set your sitemap URL in the ping URL.
Examples:
ping: "http://www.google.com/webmasters/sitemaps/ping?sitemap=http://localhost/sitemap.xml.gz" ping: "http://search.yahooapis.com/SiteExplorerService/V1/ping?sitemap=http://localhost/sitemap.xml.gz"
run: PATH
Run immediately the script PATH (relative to base-path),
if the gen-sitemap option --allow-run
is set.
Consider security implication enablig such options.
Example: (see add-sitemap)
Options
changefreq=FREQ
The usual frequency of page changes.
The valid values (from sitemaps.org) of FREQ are:
always, hourly, daily, weekly, monthly, yearly, never
.
priority=PRIORITY
The relative priority of the page (to see order the crawl downloads. This program use only the digit-dot-digit format.
The range of valid values (from sitemaps.org) is 0.0 to 1.0.
lastmod=TIME
This option indicate the last modification of the page, using the
W3C time format.
i.e.: lastmod=2007-12-30|2008-12-31T23:45Z|...
.
class=CLASS
Set the actual URL as member of sitemap index CLASS. This option is used to split huge quantities of URL into multiple sitemap. You should divide che URL in class of usage, so not to update all sitemaps.
index=INDEX
Use directory alias instead of index file specified with INDEX. Don't set (or set it with an invalid value (i.e. include a slash) to ignore these aliases. The filters will check the INDEX file, but it will write only the directory to sitemap.
Example:Set index=index.php
to use i.e. dir/ instead of dir/index.php
in the sitemap.
setlastmod=True|False
If setlastmod=True
is set, the lastmod
is set
reading the data from directories
ignore-filters=True|False
If ignore-filters=True
, the actual command will ignore the
actual filters.
Filters
The filters are a powerfull methods to select files to be included in filemap, and to select options using pattern.
There are two times of mask: the directory mask, which ends
with a slash /
and the file masks (non slash
terminating slash.
All masks match from the beginning of the string. The directory masks match only the directory part (complete or partial). Instead the other mask should match the complete filename.
There is two special characters: *
and the **
.
The single star is uses like expected: any number of characters in a path component
(i.e. any character without slash /
). You cah use multiple *
in the mask (i.e. */2007-*-blogs/index_*.*
).
The double star could match also the slash, so it can includes any number of directories.
Examples
*.html # all files with suffix ".html" on base-path directory */*.html # all files with suffix ".html" on the first level sub-directories **/*.html # all files with suffix ".html" under the base-path, but not in the base-path directory **.html # all files with suffix ".html" under the base-path # but: **index.html # could include old_index.html and dir/old_index.html **/index.html # use these two lines to match all index.html files index.html
Order of filters
The default is to match the file. The filter are tested in order and the result (ignore or add) is given by the first non-directory match (mask not ending with slash). On positive result the next filters are parsed to complete the missing options.
So usually the filter are written in blocks: first block contain the
directory masks (from particolar to general). Then it includes a block with
particular files or directory, and at the end the general patterns.
Eventually with filter-reset
we can have rules for special
part of the web site.
Examples:
filter-reset: # directory block filter-ignore: logs/ # ignore non accessible or non public directories filter-ignore: tools/ # non public directory filter-ignore: wiki/ # wikimedia prefer rewrite: let use virtual directory. # file block filter-ignore: y_key_*.html # Y! sitemaster verification file filter-ignore: google*.html # google sitemaster verification file filter-add: **.html # add all html files filter-add: robots.txt # and robots.txt filter-ignore: ** # and ignore all other files # option block (files have already a match) filter-add: index.html priority=1.0 filter-add: important/index.html priority=1.0 filter-add: */index.html priority=0.9 filter-add: **/index.html priority=0.7 filter-add: important/*.html priority=0.7 filter-add: important/**.html priority=0.6 filter-add: important/news.html changefreq=weekly filter-add: talks/*/ index="/none/" changefreq=never
Running gen-sitemap
The gen-sitemap is a python script, so you need a python interpreter to run the script.
It support the followin options:
- -q, --quiet
- don't print warnings and non fatal error messages
- -v, --verbose
- be more verbose
- -o FILE, --output=FILE
- select the output sitemap. Default is sitemap.xml.gx (or sitemap.xml with --plain)
- -c FILE, --conf FILE
- Configuration file. (defaut: sitemap.conf)
- -p, --plain
- save the sitemap without compression
- -n, --notify
- send notice to web engines (according ping option)
Sources / download
The version of the program is in get-sitemap repository. The program is distributed with the GNU GPL v2 license..
Contact
For bugs report, comments, improvements, etc, but NOT spam you could contact me at cate @ cateee.net.