A Spider Detector
I stumbled onto a way to be notified when a search engine
spider visits a web page.
We sell our Master Series CGI software via PayPal.
To customize the buying experience and to record any
affiliate commissions owed, we use proprietary software
as an interface.
With that software, whenever anyone clicks on a "buy" image
link with an incorrect product ID, we get notified by email.
It's a feature to help us keep broken links to a minimum.
Well, we had such a link. And I noticed both Yahoo! and
Google spiders used it.
(I don't know whether or not MSN's search spider would
have followed the link, as the link may have been fixed
before their spider was at that particular page.)
So there it is.
It might not spot every spider, but it will report any that
follow a link on your web page created according to this
article.
The link on your web page launches a script. The script
sends you an email with the particulars whenever the link
is clicked.
After the email is sent, a "307 Temporary Redirect" status
code is delivered to the spider (or browser, if a human
clicks on the link). The "307" tells the spider, in
essence, that the content at the link might change and
to follow it again the next time it comes around. (See
http://w3.org/Protocols/rfc2616/rfc2616-sec10.html
for specifications.)
I'll present the script in a moment.
First, let's get the link right.
The link URL ends with a question mark and the identity
of the web page where the link is at. The reason for the
identity is that some browsers don't provide referring
URLs to scripts. The ?identity parameter bypasses that
unreliable reference.
Thus, your link might look like
or it might have linked text instead of an image, like
PHP or SSI could be used to automatically provide the
?identity parameter. Both PHP and SSI construct the page
before it's sent to browsers (or spiders). But JavaScript
should not be used. Spiders might or might not run the
JavaScript to provide the correct parameter.
So, hard code the ?identity parameter into the link unless
you know how to do it automatically with PHP or SSI.
The link should be in a place where spiders will likely
follow it. Preferably, humans won't click on it.
The link might be a horizontal bar image or a copyright
symbol or something else unlikely to be clicked on.
Here is the script:
Edit the script with a plain text word processor. Use an
FTP program to upload the script. Upload as a plain text
file, not as a binary file.
Then use the script's URL as the link on your web pages.
The "UserAgent" information in the email is a duplicate of
what the spider (or browser) sent to the script. Google's
web page spider has "Googlebot" somewhere in there.
Yahoo!'s web page spider has "Yahoo! Slurp". Other spiders,
and browsers, have their own identification.
The "Page" information is a duplicate of the ?identity
parameter in the link.
And the "IP Address" information is, of course, the IP
address of the visiting spider or browser.
There you have it, a spider detector :)
Question:
Did you find this article interesting and understandable? How can it be improved?
Your response is anonymous.
When done typing, click anywhere outside the box. [more info]
Will Bontrager
©2005 Bontrager Connection, LLC
Please note:
Articles on this website are presented "as is". However -
If you have a question about a CGI script, HTML, CSS, PHP, or JavaScript
Ask one of our Experts and you'll have your answer!
Click here for details.