Combatting E-mail Fraud: The Phishing Net

"Phishing" is the term coined by hackers for attempting to lure personal information out of people by pursuading them to visit web sites that look like genuine bank, credit card, or payment sites, when they are actually sophisticated fakes of those sites.

This tries to give a description of exactly how the phishing net works. It is pretty complicated, so this description can't be perfect.

Many of the items listed below handle "obfuscations" (attempts to disguise the real text) of text and URLs. These include swapping letters around, using letters that look very like other letters, using ";" instead of ":", using "," instead of "." and many tricks like that. I have tried to highlight which rules handle obfuscations, but I have not given the details of exactly what the rule will accept. There are many many variations on the expected text that will be detected.

    Keep track of all <BASE> tags as they provide a root URL for every relative link on the page.
    Attach the <BASE> URL onto the front of all relative URLs contained in every link on the page.
    Look for links contained in imagemaps. The imagemap may be inside a link to a safe site, and contain an image of the text of the name of the safe site. But it can have a rectangle defined in it, whose link destination is a fraud site. Reduce these by removing imagemaps so the real destination of the link is used instead of the apparent destination.
     
     
    Real destination or
    apparent destination
    Operation
    apparentConvert to lower case.
    apparentRemove %a0 encoded characters (hard space).
    apparentDecode all %-encoded characters.
    apparentRemove all white space.
    apparentChange any \ to / as many browsers do this quietly to help Windows authors.
    apparentRemove all HTML tags.
    apparentRemove the username part of email addresses.
    apparentRemove all &-encoded symbols such as < and >.
    realInsert the BASE url if the link is relative and the BASE url is defined.
    realConvert to lower case.
    realRemove %a0 encoded characters (hard space).
    realDecode all %-encoded characters.
    realForce "safe" result if it does not contain either a . or a /.
    realRemove all white space.
    realForce "safe" result if it is an email address.
    realRemove all HTML tags.
    realRemove "blocked::" labels as inserted by some other products.
    realRemove any leading http:// or ftp:// or slight variations on those, including replacing the : with a ;.
    realForce "safe" result if it is a mailto: link.
    realRemove everything after the first / or ?.
    realRemove any trailing pr, p or ul tags.
    realForce "safe" result is it is a file: link.
    realForce "safe" result if it is a link to somewhere else in the same page (internal link).
    realRemove any trailing /.
    realForce "dangerous" result if URL contains any non-printable-ASCII characters.
    apparentContinue searching if any of these are true:
    1. it starts with the letters usually used at the start of a website name, e.g. www, ftp and any mis-spellings or transpositions of these,
    2. it ends with .com, .org, .net, .info, .biz, .ws or other strings which appear to look like this,
    3. it ends with .com or .co followed by a 2-letter country code,
    4. it starts with http: or ftp: or mailto: or any mis-spellings or obfuscated versions of these,
    5. you are looking for numeric ip addresses (Phishing By Numbers) and the link contains no < nor > nor g-z characters.
    apparentRemove leading strings that look like http:, ftp: mailto: and other obfuscations of these.
    apparentRemove everything after the first /.
    apparentRemove all trailing . characters (and obfuscations).
    apparentAdd www. on the front unless it already starts with www, ftp, mailto or obfuscations of these.
    realForce a "dangerous" result if Phishing By Numbers and link is numeric (IPv4 and IPv6).
     
     
    bothCompare the apparent destination with the real destination, with an optional www on the front.
     
     
    If they do not match, and the real address is not in the Phishing Safe Sites file, trigger a "dangerous" result.