LOOKUP MAPS (hash, SQL) AND ACCESS LISTS EXPLAINED
==================================================
   Updated: 2002-04, 2002-06, 2002-11, 2002-12,
            2003-03, 2003-05, 2003-06, 2003-09, 2003-12,
            2004-01, 2004-03, 2004-12,
            2005-01, 2005-03, 2005-05, 2005-08,
            2010-10
   Mark Martinec <Mark.Martinec@ijs.si>

(applies to the semantics of amavisd.conf variables such as:
   %virus_lovers,     %bypass_checks,
   @virus_lovers_acl, @bypass_checks_acl,
   $virus_lovers_re,  $bypass_checks_re,
   %local_domains, @local_domains_acl, %mailto, ... )

NOTE:
  All lookups are performed with raw (rfc2821-unquoted
  and unbracketed) addresses as a key,
      i.e.:    Bob "Funny" Dude@example.com
      not:     "Bob \"Funny\" Dude"@example.com
      and not: <"Bob \"Funny\" Dude"@example.com>


Several configurable settings in amavisd are controlled through the use
of table lookups (hash/associative array), access control lists (array),
Perl-regexp -based access control lists, SQL or LDAP lookups.
The subroutine that does all the lookups is:

  sub lookup($$@) {
    my($get_all, $addr, @tables) = @_;

It perform a lookup for a key (usually a recipient e-mail envelope address,
unless otherwise noted) against one or more lookup tables / maps.
The set of maps used to be hard-wired into the program (but no longer is),
and the order chosen is: from specific to more general, and from faster
to slower, which is usually flexible enough. Thus the default sequence
of lookups: SQL, LDAP, hash, ACL, regexp, constant. The first that returns
a definitive answer (not undef/NULL) stops the search.

The SQL and LDAP are somewhat specific and are always consulted first.
There can only be one (or none) SQL and one (or none) LDAP lookup.
This is an implementational limitation, and might be lifted some day.

The lists of static lookup tables are configurable since 20040701
(amavisd-new-2.0), and is controlled by array variables such as:

  @virus_admin_maps = (\%virus_admin, \$virus_admin);
  @viruses_that_fake_sender_maps = (\$viruses_that_fake_sender_re, 1);
  @spam_kill_level_maps = (\$sa_kill_level_deflt);
  @local_domains_maps =
    (\%local_domains, \@local_domains_acl, \$local_domains_re);
  @bypass_virus_checks_maps =
    (\%bypass_virus_checks, \@bypass_virus_checks_acl, \$bypass_virus_checks_re);
  @virus_lovers_maps =
    (\%virus_lovers, \@virus_lovers_acl, \$virus_lovers_re);

See amavisd.conf-default for a complete list of these @*_maps variables.
The above example shows that the default value of these arrays exactly
corresponds to the formerly hard-wired search order. Users are free to
leave these @*_maps variables at their default, referencing the legacy
variables, or the list can be replaced entirely. There may be any number
of lookup tables of any static type specified in these lists. Some restrain
is warranted nevertheless for efficiency reasons - one lookup into larger
lookup table is ofter quicker than two lookups into smaller ones.

Some lookup maps can only return boolean result (e.g. ACL), other maps
may return any value, which can be interpreted as boolean, numeric, string
or possibly other. The result of some lookup maps (e.g. regexp) may include
pieces of lookup key.

If a match is found, the subroutine lookup() returns whatever the map
returns;  undef is returned if nothing matches (which for Perl is false
as well).


A CONSTANT

Specifying a Perl scalar as an argument to lookup() is a degenerate
case of a lookup table: it matches any key, and the value of the
scalar is returned as the match value.

Specifying a scalar argument in a call to lookup() (e.g. as the last element
in @*_maps arrays) is useful as a last-resort (catchall, default) value.

One level of indirection is alowed, so the following three cases are
equivalent:
  $sa_kill_level_deflt = 6.0;
  @spam_kill_level_maps = (\%some_hash, \$sa_kill_level_deflt);

and:
  $sa_kill_level_deflt = 6.0;
  @spam_kill_level_maps = (\%some_hash, $sa_kill_level_deflt);

and:
  @spam_kill_level_maps = (\%some_hash, 6.0);

The first case allows for the value of a scalar variable to be assigned
even _after_ the assignment to @*_maps, so this still works as expected:
  @spam_kill_level_maps = (\%some_hash, \$sa_kill_level_deflt);
  $sa_kill_level_deflt = 6.0;

but the following does not (it uses a value in the scalar variable
at the time of assignment to the list, which is most likely not 6.0):
  @spam_kill_level_maps = (\%some_hash, $sa_kill_level_deflt);
  $sa_kill_level_deflt = 6.0;


HASH LOOKUPS (associative array lookups)

For arguments to subroutine lookup() of type hash-ref, the argument
is passed to subroutine lookup_hash(), which does a lookup into
a Perl hash.

Hash lookups (e.g. for User+foo@Sub.Example.com) are performed in the
following order:
 - lookup for user+foo@sub.example.com
 - lookup for user@sub.example.com (only if $recipient_delimiter is '+')
 - lookup for user+foo@
 - lookup for user@  (only if $recipient_delimiter is '+')
 - lookup for sub.example.com
 - lookup for .sub.example.com
 - lookup for .example.com
 - lookup for .com
 - lookup for .

The search sequence stops as soon as a match is found, and the value
of the matched entry determines the result.

The domain part of a key is always lowercased, the localpart is lowercase
when $localpart_is_case_sensitive is true (not case-sensitive by default).
The keys in a hash should match the case of a key, i.e. should typically
be entirely in lowercase.

A field value undef implies that the next lookup table (if there are more)
is to be tried. In plain words, undef means "this table does not know
the answer, try the next one". Further searching in this table
(for possibly more general defaults) is terminated.

NOTE: a null reverse path e-mail address used by MTA for delivery status
notifications (DSN) has empty local part and empty domain. As far as the
lookup is concerned (which uses raw, i.e. non-quoted and non-bracketed
address form), this address is @, i.e. a single character "@".
The lookup_hash for null address goes through the following sequence
of keys: "", "@", "." (double quotes added for clarity, they are not part
of the key).

There is a subroutine read_hash() available for use in amavisd.conf.
It can read keys from a plain text file, and load them into a Perl hash.
Format of the text file: one address per line, anything from '#' to the end
of line is treated as a comment, but '#' within correctly quoted rfc2821
addresses is not treated as a comment (e.g. a hash sign within
"strange # \"foo\" address"@example.com is valid). Leading and trailing
whitespace is discarded, empty lines (containing only whitespace and comment)
are ignored. Addresses are converted from quoted form into internal (raw)
form and inserted as keys into a given hash, with a value of 1 (true).
Each address can have an associated optional value (also known as the
'righthand side' or RHS) separated from the address by whitespace.
An absence of a value implies 1 (true). The $hashref argument is returned
for convenience, so that one can say for example:
  $per_recip_whitelist_sender_lookup_tables = {
    '.my1.example.com' => read_hash({},'/var/amavis/my1-example-com.wl'),
    '.my2.example.com' => read_hash({},'/var/amavis/my2-example-com.wl') }


LIST LOOKUPS (ACL)

For arguments to subroutine lookup() of type array-ref, the argument
is passed to subroutine lookup_acl(), which does an access list lookup:

  sub lookup_acl($$) {
      my($addr, $acl_ref) = @_;

The supplied e-mail address is compared with each member of the
lookup list in turn, the first match wins (terminates the search),
and its value decides whether the result is true (yes, permit, pass)
or false (no, deny, drop). Falling through without a match
produces false (undef). Search is case-insensitive.

NOTE: lookup_acl is not aware of address extensions and they are
not handled specially!

If a list element contains a '@', the full e-mail address is compared,
otherwise if a list element has a leading dot, the domain name part is
matched only, and the domain as well as its subdomains can match. If there
is no leading dot, the domain must match exactly (subdomains do not match).

The presence of character '!' prepended to a list element decides
whether the result will be true (without a '!') or false (with '!')
in case this list element matches and terminates the search.

Because search stops at the first match, it only makes sense
to place more specific patterns before the more general ones.

Although not a special case, it is good to remember that '.' always matches,
so a '.' would stop the search and return true, whereas '!.' would stop the
search and return false (0).

Examples:

given: @acl = qw( me.ac.uk !.ac.uk .uk )
  'u@me.ac.uk' matches me.ac.uk, returns true and search stops

given: @acl = qw( me.ac.uk !.ac.uk .uk )
  'u@you.ac.uk' matches .ac.uk, returns false (because of '!'), search stops

given: @acl = qw( me.ac.uk !.ac.uk .uk )
  'u@them.co.uk' matches .uk, returns true and search stops

given: @acl = qw( me.ac.uk !.ac.uk .uk )
  'u@some.com' does not match anything, falls through and returns undef

given: @acl = qw( me.ac.uk !.ac.uk .uk !. )
  'u@some.com' similar to the previous, except it returns 0 instead of undef,
  which would only make a difference if this ACL is not the last argument
  in a call to lookup()

given: @acl = qw( me.ac.uk !.ac.uk .uk . )
  'u@some.com' matches catchall ".", and returns true

more complex example: @acl = qw(
  !The.Boss@dept1.xxx.com .dept1.xxx.com
  .dept2.xxx.com .dept3.xxx.com lab.dept4.xxx.com
  sub.xxx.com !.sub.xxx.com
  me.d.aaa.com him.d.aaa.com !.d.aaa.com .aaa.com
);


Comparing hash (associative array) and ACL:

For smaller sets of keys and if only boolean results are needed,
both hash and ACL are appropriate.

- hash is still effective for lots of keys, ACL search is linear;
- hash can return any value, not just true or false;
- hash can strip away address extension, ACL can not;

- ACL appears simpler and more obvious for smaller sets;
- ACL can accommodate arbitrarily nested if-then-elseif-then-...-else cases
  whereas hash only follows a fixed order of stripping addresses;


ACL FOR IP ADDRESSES

A special type of lookup is an IP-matching access list implemented by
lookup_ip_acl(). It performs a lookup for an IP address against a list
or an asssociative array (a hash) of IPv4 or IPv6 networks. It is used
by amavisd for example to check if the SMTP client (normally your MTA)
is allowed to connect, which is why it is sometimes called 'access control
list' or ACL (the variable is @inet_acl).

IP address is compared with each member of an access list in turn,
the first match wins (terminates the search), and its value decides
whether the result is true (yes, permit) or false (no, deny).
Falling through without a match produces false (undef).

The presence of character '!' prepended to a list member decides
whether the result will be true (without a '!') or false (with '!')
in case this list member matches and terminates the search.

Because search stops at the first match, it only makes sense
to place more specific patterns before the more general ones.

Network can be specified in classless notation n.n.n.n/k, or using
a mask n.n.n.n/m.m.m.m . Missing mask implies /32, i.e. a host address.

Although not a special case, it is good to remember that '::/0'
always matches any IPv4 or IPv6 address (even syntactically invalid address).

The '0/0' is equivalent to '::FFFF:0:0/96' and matches any syntactically
valid IPv4 address (including IPv4-mapped IPv6 addresses), but not other
IPv6 addresses!

Example
  given: @acl = qw( !192.168.1.12 172.16.3.3 !172.16.3.0/255.255.255.0
                    10.0.0.0/8 172.16.0.0/12 192.168.0.0/16
                    !0.0.0.0/8 !:: 127.0.0.0/8 ::1 );
  matches rfc1918 private address space except host 192.168.1.12
  and net 172.16.3/24 (but host 172.16.3.3 within 172.16.3/24 still matches).
  In addition, the 'unspecified' (null, i.e. all zeros) IPv4 and IPv6
  addresses return false, and IPv4 and IPv6 loopback addresses match
  and return true.

If the supplied lookup table is a hash reference, match a canonical IP
address: dot-quad IPv4, or preferred IPv6 form, against hash keys. For IPv4
addresses a simple classful subnet specification is allowed in hash keys
by truncating trailing bytes from the looked up IPv4 address. A syntactically
invalid IP address can only match a hash entry with an undef key.

Besides looking up full CIDR-style IPv4 or IPv6 lists, later versions of
lookup_ip_acl() also make possible matching against a hash lookup table,
which only allows for full addresses in canonical form (dotted-quad IPv4
addresses without leading zeroes or IPv6 addresses in canonical preferred
form: x:x:x:x:x:x:x:x), or classful IPv4 subnets with truncated octets,
such as:
  ('10.11.12.13'=>1, '192.168.1.2'=>0, '192.168'=>1, '127'=>1, '10'=>1)

A convenient method of reading CIDR-style IP lists or a hash from a file
if by calling provided routines read_array or read_hash, e.g.:
  @mynetworks_maps = (read_array('/etc/amavisd-mynetworks'), \@mynetworks);
or:
  @mynetworks_maps = (read_hash('/etc/amavisd-mynetworks'), \@mynetworks);


REGULAR EXPRESSION LOOKUPS

For arguments to subroutine lookup() of type Amavis::Lookup::RE
(objects), the object is passed to method lookup_re, which does a
lookup into a list of Perl regular expressions (regexp or RE for short).

The full unmodified e-mail address is always used, so splitting to localpart
and domain or lowercasing is NOT performed. The regexp is powerful enough
that this is unnecessary. The routine is useful for other RE tests, such as
looking for banned file names.

Each element of the list can be a ref to a pair, or directly a regexp
('Regexp' object created by qr operator, or just a (less efficient)
string containing a regular expression). If it is a pair, the first
element is treated as a regexp, and the second provides a return value
in case the regexp matches. If not a pair, the implied result value
of a match is 1.

The regular expression is taken as-is, no implicit anchoring or setting
case insensitivity is done, so do use a qr'(?i)^user@example\.com$',
and not a sloppy qr'user@example.com', which can easily backfire.
Also, if qr is used with a delimiter other than ' (apostrophe), make sure
to quote the @ and $ .

The pattern allows for capturing of parenthesized substrings, which can
then be referenced from the result string using the $1, $2, ... notation,
as with the Perl m// operator. The number after a $ may be a multi-digit
decimal number. To avoid possible ambiguity the ${n} or $(n) form may be used.
Substring numbering starts with 1. Nonexistent references evaluate to empty
strings. If any substitution is done, the result inherits the taintedness
of the key. Keep in mind that $ and @ characters needs to be backslash-quoted
in qq() strings. Example:
  $virus_quarantine_to = new_RE(
    [ qr'^(.*)@example\.com$'i => 'virus-${1}@example.com' ],
    [ qr'^(.*)(@[^@]*)?$'i     => 'virus-${1}${2}' ] );

Example (equivalent to the example in lookup_acl):
   $acl_re = new_re->new(
                 qr'@me\.ac\.uk$'i,
                 [ qr'[@.]ac\.uk$'i => 0 ],
                 qr'\.uk$'i,
             );
   ($r,$k) = $acl_re->lookup_re('user@me.ac.uk');
or $r = lookup('user@me.ac.uk', $acl_re);

'user@me.ac.uk'   matches me.ac.uk, returns true and search stops
'user@you.ac.uk'  matches .ac.uk, returns false (because of =>0) and search stops
'user@them.co.uk' matches .uk, returns true and search stops
'user@some.com'   does not match anything, falls through and returns false (undef)

NOTE: new_RE is a synonym (shorthand) for the
      internal subroutine Amavis::Lookup::RE::new

See Perl documentation (or Google the Internet) for the description
of Perl regular expressions. They are just enhanced version of Posix regular
expressions, i.e. what egrep, awk and sed thrive on. Here are the
most important constructs (simplified):

    .   Match any character                 inter..t
    |   Alternation                         alfa|beta|gamma
    ()  Grouping                            (pre|post)fix
    []  Set of characters (char. class)     [Aa]lfa[0-9]
    ^   Match the beginning of the string   ^MakeMoney
    $   Match the end of the string         com$
    \   Quote the next metacharacter        \.com$
                                            ^\$\$\$\+spam@\[127\.0\.0\.1\]$
    most other characters just match themselves

quantifiers may be placed after the pattern to modify its meaning
from 'match itself exactly once' into:
    *      Match 0 or more times            ^alfa.*omega$
    +      Match 1 or more times            alfa +beta
    ?      Match 1 or 0 times               (first)?aid
    {n}    Match exactly n times            0{6}
    {n,}   Match at least n times           !{3,}
    {n,m}  Match at least n but not more than m times


SQL LOOKUPS

For general SQL considerations and the interpretation of @lookup_sql_dsn
please see documentation in README.sql .

The amavisd.conf variable @lookup_sql_dsn controls access to the SQL
server (dsn = data source name). If the list @lookup_sql_dsn is empty,
no attempts to use SQL for lookups will be made, and no code to use DBI
will be loaded or compiled (if @storage_sql_dsn is empty as well).

For arguments to subroutine lookup() of type Amavis::Lookup::SQLfield
(objects), the object is passed to a method lookup_sql_field, which does
a lookup into an SQL table by using Perl module DBI.

An SQL 'select' requests all available fields from the specified tables,
and the result is cached (just for this mail message processing).
Individual fields can be extracted one at a time from this cache
very quickly, so there is no penalty in using several calls to lookup
for different fields (for the same key) in different parts of the program.

lookup_sql() performs a lookup for an e-mail address against an SQL map.
If a match is found it returns whatever the map returns (a reference
to a hash containing values of requested fields), otherwise returns undef.
A match aborts further fetching sequence.

lookup_sql_field() also performs a lookup for an e-mail address against
an SQL map. It first calls lookup_sql() if it hasn't been called yet for
this key, requesting it to return all matching records. Instead of returning
the whole record as lookup_sql does, it returns just a value of one particular
table field, the first one with a defined (non-NULL) value from the list
of matching records (or undef if there are none).

The lookup_sql_field() is the subroutine that gets called from lookup()
for arguments (objects) of type Amavis::Lookup::SQLfield.

A field value NULL is translated to a Perl undef, and is treated the same
as a nonexistent field of the specified name. For any field (apart from
the only exception field 'local') a record with a NULL/undef/nonexistent
field is skipped and search continues with the next, more general,
matching record until a defined value is found or all matching records are
checked. The only exception is a field named 'local', where a nonexistent
field on an otherwise matching record implies a value true - see section
"Special handling of optional SQL field 'users.local'" further down.

Boolean fields are usually represented as a single character (instead of
an integer) to minimize storage. Characters N,n,F,f,0,NUL and SPACE
represent false (0), any other character represents true. Trailing blanks
are ignored. It is customary to use Y for true and N for false.

SQL lookups (e.g. for user+foo@example.com) are performed in order
which is usually specified by 'ORDER BY...DESC' in the SELECT statement;
otherwise the order is unspecified, which is only useful if just specific
entries exist in a database (e.g. full address always, not domain part only
or mailbox parts only).

The following order (implemented by sorting on the 'priority' field
in DESCending order, zero is low priority) is recommended, to follow
the same specific-to-general principle as in other lookup tables;
the first column is a suggested priority (the exact value does not matter
as long as the order is maintained):

9 - lookup for user+foo@sub.example.com
8 - lookup for user@sub.example.com (only if $recipient_delimiter is '+')
7 - lookup for user+foo (only if domain part is local)
6 - lookup for user     (only local; only if $recipient_delimiter is '+')
5 - lookup for @sub.example.com
3 - lookup for @.sub.example.com
2 - lookup for @.example.com
1 - lookup for @.com
0 - lookup for @.       (catchall)

NOTE:
 this is different from hash and ACL lookups in two important aspects:
   - key without '@' implies a mailbox name, not a domain name;
   - naked mailbox name lookups (without an '@', e.g. 'user') are only
     performed when the mail address matches local_domains lookups.

The domain part is always lowercased when constructing a key,
the localpart is not lowercased when $localpart_is_case_sensitive is true.
With SQL datatypes CHAR and VARCHAR this does not matter anyway, as such
fields are matched case-insensitively.

NOTE: a null reverse path e-mail address used by MTA for delivery status
notifications (DSN) has empty local part and empty domain. As far as the
lookup is concerned (which uses raw, i.e. non-quoted and non-bracketed
address form), this address is @, i.e. a single character "@".
The SQL lookup for null address goes through the following sequence
of keys: "", "@", "@." (double quotes added for clarity, they are not part
of the key).

Table names and field names as used for SQL lookups are hard-wired in the
routine prepare_sql_queries(). Please adjust it to will. Field names should
be unique even without the table prefix. If they are not, the last one
in the SELECT field list prevails.

For an example schema that can be used with MySQL or PostgreSQL or SQLite
see README.sql.


Special handling of optional SQL field 'users.local'

A special shorthand is provided when SQL lookups are used: when a match
for recipient address (or domain) is found in SQL tables (regardless of
field values), the recipient is considered local, regardless of static
@local_comains_acl or %local_domains lookup tables. This simplifies
life when a large number of dynamically changing domains is hosted.
To overrule this behaviour, add an explicit boolean field 'local'
to table 'users' (missing field defaults to true, meaning record match
implies the recipient is local; a NULL field 'local' is not special,
it is interpreted as undef like other NULL fields, causing search
to continue into other lookup tables).

Since amavisd-new-20030616-p7:
changed the default value for local_domains_sql lookup for the catchall
key '@.' under conditions: when user record with key '@.' is present in
the database and a field 'local' is not present. Previously it surprisingly
defaulted to true, now it falls back to static lookup table defaults,
the same as if the record '@.' were not present in the table or as if
the field value 'local' was NULL.


Case sentitivity of string comparison

Amavisd-new expects string comparison to be case sensitive (but does not
mind if it isn't). When forming a SELECT clause it lowercases parts of
keys that are supposed to be case-insensitive, such as the domain name.
The local part of the e-mail address in SQL search keys is lowercased
if and only if the $localpart_is_case_sensitive variable is false (which
is a default).

This means that case-insensitive parts of e-mail addresses as kept in the
SQL database should be in lower case, otherwise match may fail, depending
on SQL server behaviour and the use of BINARY prefix in string data types.

Since MySQL version 3.23.0 it is possible to declare a data type of
a column as BINARY, forcing string comparision to be case sensitive,
as it is in PostgreSQL. This is only required for sites that want to treat
localpart as case-sentitive and have $localpart_is_case_sensitive true.