When thinking of click fraud and related issues, it is essential to consider a number of areas, first there are the physical attacks that come from human such as “Paid to Read” and “Pay to Click” sources. Looking at correlations from time and location I would look at correlation browsing patterns to find sources that are likely to be engaging in unacceptable activities.
If the cost of directly paying or coercing a human to click on a banner is less than the returns of clicking on the banner fraud will occur. Reference sources and referrers sites could be a pornographic site as well with many sites offering access to porn if they solve CAPTCHA’s for instance.
Impression spam comes about as a consequence of HTTP requests for web pages that contain advertisements, but that do not inevitably match up to a user viewing the page. A web crawler or “scraping” program could be used to issue an HTTP re-quest for a web page that happens to contain a banner. Statistically, this variety of request could be distinguished from those requests issued by human users as the banner requests would not correlate to the other images and calls made by the page.
Next there are invalid clicks – these occur due to malicious intent from either “advertiser competitor clicking” or “publisher click inflation”. I see advertiser competitor clicking being the type that is a problem in this instance. For this it would be necessary to analyse click sources against competitor keywords.
Next there is the issue of Tor networks and open Proxies and also those that strip cookies, modify identifying information and change requests. Infected cyber-cafes are also an issue. However, source address location to client market may be used in many cases to determine fraud. Also, over time these are likely to succumb to analysis.
Next there are the particular robotic attacks that could be deployed. I would be checking for signs of “bots”. There is a possibility of invalid user-agent string or unlikely fields in the headers of the HTTP requests. The statistical distribution of the user-agent string should have some correlation to the distribution of browsers.
Clickbot networks have a level of predictability as do for-sale/ for-rent botnets.
“Forced browser clicks” are more difficult. This is more likely to require offline detection. The aggregate set of clicks should correlate to the distribution of files on the web server.
Next there are a number of other areas to consider:
- Covert_TCP and other covert channel methods,
- DNS rebinding attacks
- Distributed malware
- XSS, Flash with HTTP calls, etc
Some other considerations would have to include p0wf (Passing Fingerprinting of Web Content Frameworks) and time based correlations.