sitemap.pl Script : Documentation  
Home > sitemap.pl > Documentation   Updated: 11-Mar-2010

sitemap.pl Script Documentation

Browse sitemap.pl Script Documentation
  1. Installation
  2. Log File
  3. Configuration File
  4. Parameters
  5. Sitemap File
 
  1. Revisions History
  2. Support
 

Note:   "cgi-bin" directory
The instructions in this documentation assume that your web server has a directory named cgi-bin where scripts are to be stored on your web server. Some web servers use a different name for this directory, such as: cgi-local, or cgi, or mainwebsite_cgi or something similar. In that case, where you see cgi-bin in any instruction, substitute it with the directory name that your web server uses.



» 1. Installation

Follow these steps to install the sitemap.pl script:
(it might be easier if you print out this documentation, or at least the first two or three pages, and check off each completed step).

  1. Download sitemap.pl .zip file to your computer.

  2. Unzip the .zip file. You should end up with a directory sitemap-YYMMDD (such as sitemap-080723).

  3. Open your FTP program and connect to your web server.

  4. Go into your cgi-bin directory and create a subdirectory named sitemap

  5. Go into the newly created subdirectory sitemap

  6. On your local computer, go into the sitemap-YYMMDD

  7. Upload sitemap.pl to your web server. Note: Upload the file using ASCII transfer mode. If necessary, change the first line of sitemap.pl to the path of perl on your web server (default is: /usr/local/bin/perl).

  8. Select sitemap.pl file on your web server and do CHMOD 755 (User: read/write/execute; Group: read/execute; Other: read/execute) so that sitemap.pl can be run.

  9. On your web server, go to your root directory (the directory where your home page file is located).

  10. On your local computer, open the htaccess-rules.txt file that came in the .zip file. Select and copy all those statements. Then edit the .htaccess file located on your web server. Paste these sitemap-related statements at the start of your .htaccess file and save it to your web server.

    Note: If your web server does not have a .htaccess file, you can upload htaccess-rules.txt and then rename it to .htaccess (starts with a . dot and does not end with any .txt ending).

    Note: If you are using WordPress or any other publishing system that has virtual files, put the sitemap-related statements before the statements of your publishing system. Example, put sitemap-related statements before the WordPress-related statements. In general, if you alread have a statement RewriteRule ^.*$ (i.e.: matches everything) then it should be kept at the end of your .htaccess file.

  11. To test that sitemap.pl is installed successfully, open your web browser and access DOMAIN.com/sitemap.txt (substitute in your own domain name in this URL) and you should see a list of the files on your website.

    500 Internal Server Error
    If you see a "500 Internal Server Error" then follow these steps:

    1. Redownload the .zip file to your Windows PC and unzip it again.

    2. Reupload sitemap.pl file using ASCII transfer mode (not binary mode!)

    3. Select sitemap.pl on the web server and do CHMOD 755 (User: read/write/execute; Group: read/execute; Other: read/execute)

    4. If necessary, change the first line of sitemap.pl so it has the correct path to perl on your web server (default is /usr/local/bin/perl; try: /usr/bin/perl). If you're not sure, look in any other .pl file on your web server that works, or ask your hosting company what is the path to perl.

    5. Test the installation by accessing DOMAIN.com/sitemap.txt again. Press your web browser's Refresh/Reload button.

  12. If you see any files in the sitemap list that you prefer not be listed, then create a configuration file as indicated in the "Configuration File" section below.

  13. In your web server's root directory, edit your robots.txt file and add the following line at the end (substitute in your own domain name):

    sitemap: http://DOMAIN.com/sitemap.xml

    Note: If you do not have a robots.txt file, run the Windows Notepad editor and create a robots.txt with the above line (substitute in your own domain name) and upload it to your web server's root directory.

  14. (optional) For instructions from Google.com on how to submit your Google Sitemap to Google.com, see: "How do I submit a Sitemap?" If you have set up your robots.txt file as indicated above, it is not necessary for you to submit your sitemap to Google; however, by submitting your sitemap to Google, you gain access to some useful GoogleBot crawler status information from Google.com. Therefore, we recommend that you do submit your Google Sitemap to Google.com

  15. sitemap.pl is distributed as shareware. If you find sitemap.pl to be useful and continue to use it, please purchase a license; it's a one-time fee of only $9.95.


» 2. Log File

Note: sitemap.pl v10.03.09-beta is required if you want a log file. Earlier versions do not generate logs. Upgrade.

When you run sitemap.pl, it creates a log file sitemap-log.txt (located in the same directory where sitemap.pl is located).

Download or view that log file using your FTP program and you will see why sitemap.pl is including/exluding each of your files.


» 3. Configuration File

An optional configuration file sitemap-skip.txt (located in the same directory where sitemap.pl is located) can be used to tell sitemap.pl what files to exclude from the sitemap.

The sitemap.pl script has a built-in list of common exclusions (e.g.: skip all .gif files). Thus you only need to create a sitemap-skip.txt only if sitemap.pl is listing files that you do not want to list in the sitemap.

The built-in exclusions are:

#-- specific files --

/postinfo.html


#-- patterns --

/core.?
/google?.html
*/HEADER.html
*/README.html
*_


#-- directories --

/cgi-bin/
/log/
/logs/
/private/
/webalizer/


#-- extensions --

.asa
.bak .bat .bmp
.css .csv
.db .dll
.exe
.gif .gz
.ico .ini
.jpeg .jpg .js
.mdb .mid .mp3 .mpeg .mpg .msi
.pdf .pl .png .psp
.rar .rm
.sql .swf
.tar .temp .tgz .tif .tiff .tmp .txt
.url
.xls .xml .xsd .xsl
.wav .wma .wmv
.xbm
.zip

Tip:   View Built-in Exclusions List
You can view the built-in exclusions list by accessing:

DOMAIN.com/cgi-bin/sitemap/sitemap.pl?skip


Comments and Whitespace: (spaces, tabs, #)

Comments are indicated with # and are discarded. Comments can be on lines by themselves or on the same line as a statement. Everything from the # to the end of the line is discarded.

You can freely use spaces and/or tabs throughout. Leading/trailing spaces and/or tabs are discarded; thus you can indent statements/comments. Blank lines are ignored. Multiple consecutive spaces and/or tabs (whitespace) are treated as a single space; thus you can use any spacing/tabing you prefer.


Filename Ending: (.ending)

To discard all files with a particular filename ending (e.g.: all .gif files), type .gif into your sitemap-skip.txt file. Note: You do not have to type in any filename ending that is part of the built-in exclusions list; those are already taken care of for you. Most of the common filename endings are already in there.

You can put more than one filename ending on the same line; simply separate each one by a tab or a space. To keep the file organized, it's recommended that you keep your filename endings sorted alphabetically (but you don't have to).


Directory: (/directory/ and /directory)

To skip an entire directory, type that directory into your sitemap-skip.txt file.

For example, /projects/ would cause the entire /projects/ directory to be excluded.

For example, /projects (no trailing /) would cause the entire /projects/ directory to be excluded.

For example, /data/private/ would cause the entire /data/private/ sub-directory to be excluded. The /data/ directory itself would still be included.


Patterns: (* and ?)

You can use * and ? to specify pattern matches.

The * matches zero or more occurrences of any character, including: A .. Z, a .. z, 0 .. 9, and all symbols including / (the directory character).

The ? is similar to the * match pattern character; however, ? does not match the / character. This is an important distinction since ? enables pattern matches such as /google?.html which would match /google-adsense.html but not /google/adsense.html (because ? does not match the / character).

For example, */HEADER.html would cause HEADER.html located in any directory (including the root directory) to be excluded.

For example, */private/ would cause a directory named private located in any directory to be excluded.


File: (/directory/file.ending)

You can exclude a particular file by simply stating it.

For example, to exclude /google.html, simply type /google.html into sitemap-skip.txt

For example, to exclude /google/adsense.html, simply type /google/adsense.html into sitemap-skip.txt

For example, /*/HEADER.html would cause HEADER.html located in any sub-directory (but not the root directory) to be excluded.


Ignore a Built-in Exclusion

To override a built-in exclusion, type the exlusion into your sitemap-skip.txt file and prefix it with a plus sign (+).

For example, if you want to override the exclusion of .pdf files and thus have .pdf files appear in the sitemap, then add the following line to your sitemap-skip.txt file:

+.pdf

Note: There must not be any spaces after the +



» 4. Sitemap File

The created XML sitemap file complies with the Sitemap Protocol 0.9 as defined by sitemaps.org

sitemap.pl automatically includes the optional <lastmod>, <changefreq>, and <priority> tags as part of each <url>. Currently, there is no configuration option to turn off generation of these tags; they are always generated.

The value of the <lastmod> tag is the last-modified timestamp of the URL (directory or file).

The value of the <priority> tag is based on the URL:

<priority> Tag
Value URL
1.0 Home page
0.8 Directory (at any depth)
0.6 File

Thus the order of priority is: home page, directories, files.

Note: There are no configuration variables to change these values.

The value of the <changefreq> tag is based on the URL's last-modified timestamp, except for the home page which is always daily.

<changefreq>
Value Condition
daily Home page
weekly Modified within 2 months
monthly Modified within 6 months
yearly Modified within 3 years
never Older than 3 years

Note: There are no configuration variables to change these values or conditions.



» 5. Parameters

To view the version number of sitemap.pl, access sitemap.pl with the version parameter:

DOMAIN.com/cgi-bin/sitemap/sitemap.pl?version

To view the built-in exclusion list, access sitemap.pl with the skip parameter:

DOMAIN.com/cgi-bin/sitemap/sitemap.pl?skip

Advanced users: To view the regular expression built from wildcard exclusions (* and ?) , access sitemap.pl with the skipre parameter:

DOMAIN.com/cgi-bin/sitemap/sitemap.pl?skipre




E.&O.E.; © Cusimano.Com Corporation; www.c3scripts.com