John Flinchbaugh Blog: Details for Referrer Verification

My very recent post about verifying referrers to help limit spam got Doug a bit concerned about the bandwidth. He wouldn't want my site poking his everytime someone clicks a link from his site to mine. Additionally, he forsees trouble with these referrer checking routines getting into loops poking each other.

First, my code already recognizes first hits from a referrer, so it can add it to the list in my servlet context. Any subsequent hit from a referrer just increments the counter. Similarly, the verification code would run once for each referrer before adding it to the list, and not again. After it's been added to the list, I should be able to trust the referrer and just increment its count for the rest of the day, so that eliminates any referrer check floods.

Loops won't happen either, because my weblog will be poking the referrer site directly from my server. I'll request the prospective referrer without a "Referer:" header, so the other site will have nothing to verify, so there's no loop, and it'll register as just a single "direct" hit. As noted previously, there'd only be one such hit per day. (Additionally, to avoid slowing the load time for my site, all this referrer checking code will move into an MDB to run asynchronously.)

It would function similarly to Snert's milter-sender for sendmail. Milter-sender tries to determine if the sender's account will accept DSN email before it accepts the incoming email. I admit that I thought this was a bit heavy-handed and wasteful of bandwidth when I first read about it, since I probably get 3000+ emails a day through mailing lists, etc. For my relatively small number of weblog hits, though, I like the idea for handling these referrer spammers.

On any given day, I have upto 30 unique referers, mostly from Google searches, so I would have only set out 30 verification requests in that whole day.

For email, I prefer pure Bayesian filters (bogofilter, specifically), since they are easily trainable, effective, and very low-noise for the rest of the world -- I don't fill mail queues trying to bounce spam destined to invalid places, I just drop it. I don't care to code a Bayesian filter for my weblog, and I don't think there's enough content in a URL to effectively filter it this way. If I try to get more content by hitting the link and analyzing the content with a Bayesian algorithm, I may as well have just checked for my URL in the first place and been done.