Who told you that?

This guy popped up in my referer logs the other day, and it turns out that he actually links back to an old post of mine. I’ve changed my opinion a bit since then, after I realized that it’s fairly trivial to write a script that validates referers. All need to you do is grab the page listed as a referer and check to see if it really contains a link back to your site. It’s only a first level technique — there are ways around it — but it would certainly catch what Joel is doing.

Thus, while Joel says there’s nothing that can be done about his technique… he’s wrong. Admittedly, I haven’t integrated my script with my general purpose log analysis scripts but in the cases where I have noticed referer spam I just update my config file and tell the scripts to ignore those referers.

I stuck my script in after the cut. It runs over an active log file, tosses out referers it’s seen before, validates new referers as per the technique above, and emails me a note when it sees a valid new referer. It will not work out of the box on your server, but it should be kind of clear what needs to be updated if you’re a perl coder. Vanity, vanity, all is vanity.

#!/usr/bin/perl -w
use strict;
####

#### Modules

####
use LWP::Simple;

use IO::File::Log;

use Carp;

use Fcntl;

use NDBM_File;

use Mail::Mailer;
####

#### Initialize variables

####
my $logfile = "LOGFILE GOES HERE";

my $baseURL = "BASE URL FOR YOUR SITE GOES HERE";

my $datafile = "DATA FILE FOR KEEPING TRACK OF REFERERS GOES HERE";

my $email = "YOUR EMAIL GOES HERE";

my %urlhist;

my $fh = new IO::File::Log($logfile);
####

#### Main loop -- over and over and over...

####
while (my $line = $fh->getline) {

my @line = split(" ", $line);

my $page = $baseURL . $line[6];

my $referer = substr($line[10], 1, -1);

my $tag = $referer . " " . $page;

(my $date = $line[3]) =~ s/^\[(.*?):(.*)/$1 $2/;
        # A little code to fix /index.php vs / confusion
        $page =~ s/index.(php|html)$//;
        # NDBM_File doesn't write data until untie(), so we tie and

# untie as we go
        tie(%urlhist, 'NDBM_File', "$datafile", O_RDWR, 0) ||

die "couldn't tie";
        next if (

($referer eq "-") ||                    # No referer

($page =~ /\.cgi/) ||                   # Don't do CGIs

($referer =~ /^http:\/\/popone/) ||     # Don't do self-refs

($referer =~ /search|query/) ||         # Search results

($urlhist{$tag})                        # We've seen it before

);
        $urlhist{$tag} = 1;                             # Mark it as seen
        if (scan($referer, $page)) {

notify($referer, $page, $email, $date);

}
        untie(%urlhist);

}
####

#### scan: check for URL in a web page

####   takes: URL to scan, URL to look for

####   returns: true if the URL to look for is found

####
sub scan {

if ($#_ != 1) {

carp "scan() takes two arguments";

return 0;

}
        my $huntURL = shift;

my $findURL = shift;
        my $page = get($huntURL);

if (! $page) {

carp $huntURL, " couldn't be downloaded";

return 0;

}
        return ($page =~ /$findURL/);

}
####

#### notify: let us know we found a match

####    takes: URL scanned, URL found, email address, date

####    returns: nothing at all

####

sub notify {

if ($#_ != 3) {

carp "notify() takes four arguments";

return 0;

}
        my $huntURL = shift;

my $findURL = shift;

my $email = shift;

my $date = shift;
        open(MAIL, "|/usr/sbin/sendmail -t");

print MAIL <<EOM;

To: $email

From: www\@YOURSITE

Subject: Link Found: $huntURL
${date}:

Referer $huntURL contains a link to $findURL.

EOM
        close(MAIL);

}

Comments

joel April 17, 2003
That is true about the validating referals. However, if you run a busy site, this is going to be a pretty hardcore experience on your host.
I actually wrote that piece about "vanity" while in a fit about a few people I know who spend the majority of their time working on "vanity" blogs, and it sort of disgusts me that smart people would spend that much time writing self-absorbed posts.. I am over it.
In testing that script, I have managed to generate (in one month) 27,000 links in google and near to top level results. I did this simply by taking the DMOZ data and running a multi-threaded script which actually just called curl with -A -e and -L -I options. no proxies, nothing. It took a little less than a week with negligable network traffic (-I does header only)..
Now I realize this is just "too easy" and as more people get involved with doing this, it will be just like spam, perhaps even worse as these guys get more and more chased into corners by laws and anti-spam sites.
So what do we do? I am not really sure, I know not to trust any logs I see and I am seeing increasing activity in this which is part of the reason I am not releasing the sourcecode.
Google is my real interest in all of this, and I have several sites set up to play with google rankings to see just how far I can go.. It is my hobby.
Anyway, on a side note, I found your post in my referrer logs.. :)
Later
-Joel De Gan