The sitemap standard allows web authors to help search engines such as Google in index their sites. It describes which pages have changed most recently and which pages should be the given the highest priority by the search engines when looking for updates.
These files do not directly influence the page rank. What they are likely to do is increase the likelihood that all the pages you want visited and indexed will be visited and indexed. If you have a large site, the search engine bots may not take the time to crawl your entire site every visit. By telling them which pages have changed and which are the most important to check, there is a much better chance that every page in your site that has changed will be searched.
It is an XML file whose format can be found at http://www.sitemap.org.
Google offers a tool if your site supports Python. My site does not, so I have written a perl script to create the file automatically.
This script is based upon a similar script by Tony Lawrence. I made changes for compatibility with our web server an to address issues that were not addressed in the original script.
This script includes every *.htm and *.html file in the site that is not in a directory name "tmp" or the directory "/test". (See the "if" statement that excludes these directories.) Exclude any such directories and files that you do not want indexed or that have access restrictions.
Sitemapper
#!/usr/bin/perl
my $sitepath="/users/faculty/phy/matthews/www-home";
my $website="http://www.wfu.edu/~matthews";chdir($sitepath); @stuff=`find . -type f -name "*.htm*" ! -name "*~"`; #Find htm, html, but not htm~ or html~. open(O,">sitemap"); print O <<EOF; <?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.google.com/schemas/sitemap/0.84"> EOF$slash="\/"; #An escaped slash character. foreach (@stuff) { chomp; $badone=$_; $badone =~ tr/-_.\/a-zA-Z0-9//cd; print if ($badone ne $_); #Print files with funky names. s/^..//; $rfile="$sitepath/$_";$DirEndIndex=rindex($rfile,$slash)+1; #What is the index of the last slash? $dir=substr $rfile,0,$DirEndIndex; #The directory name of the current fileif ( ! /tmp/ && ! /^test/) { #The above excludes the two directories I don't want indexed. #The "^" character flags "test" as a top level directory. ($dev,$ino,$mode,$nlink,$uid,$gid,$rdev,$size,$atime,$mtime,$ctime,$blksize,$blocks)=stat $rfile; ($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst)=localtime($mtime); $year +=1900; $mon++; $mod=sprintf("%0.4d-%0.2d-%0.2dT%0.2d:%0.2d:%0.2d+00:00",$year,$mon,$mday,$hour,$min,$sec); $mod=sprintf("%0.4d-%0.2d-%0.2d",$year,$mon,$mday); $freq="weekly"; #How often is the page likely to change? $freq="daily" if /^index.html/; #Include such a statement for files that change daily.$priority="0.5"; #The default priority is 0.5.#List pages below that deserve higher or lower priority.$priority="0.7" if /index.html/; #Any "index.html" page is medium priority. $priority="0.7" if /miscellaneous.html/; $priority="0.7" if /switches.html/; $priority="0.7" if /teaching.html/; $priority="0.7" if /RayTracing.html/; $priority="0.7" if /JpgVsGif.html/; $priority="0.7" if /dipole.html/;$priority="0.9" if /^index.html/; #Main page for site is top priority.print O <<EOF; <url> <loc>$website/$_</loc> <lastmod>$mod</lastmod> <changefreq>$freq</changefreq> <priority>$priority</priority> </url> EOF } }print O <<EOF; </urlset> EOF close O; unlink("sitemap.gz"); system("gzip sitemap"); system("chmod ugo+r sitemap.gz");
The script should be edited to include your site's server path and URL.
Running this script from the Unix or LInux command line will create the sitemap file. This needs to be run regularly to keep your sitemap file up to date. I suggest submitting a crontab file to run this daily. To do this, at the Unix or Linux command prompt, type
crontab -eThat will bring up an editor window. If you enter the following:
0 0 * * * fullpathname/sitemapper
save, and exit, then the sitemapper program will be run every night at midnight.
Next, you need to tell the search engines how to use your sitemap.gz file. As far as I know, only Google is using sitemaps at present. Instructions are here for submitting the location of your sitemap.
If you cannot run the script continously, then consider modifying it to run as a cgi script off a page you view regularly. If the script is not run, the sitemap file will quickly become out of date.
I am just learning about the sitemap protocol. I offer no guarantees that this script will help and not hurt your Google rankings. I request any feedback and criticism of the above script.
27,420