LOOKUP MAPS (hash, SQL) AND ACCESS LISTS EXPLAINED ================================================== Updated: 2002-04, 2002-06, 2002-11, 2002-12, 2003-03, 2003-05, 2003-06, 2003-09, 2003-12, 2004-01, 2004-03, 2004-12, 2005-01, 2005-03, 2005-05, 2005-08, 2010-10 Mark Martinec (applies to the semantics of amavisd.conf variables such as: %virus_lovers, %bypass_checks, @virus_lovers_acl, @bypass_checks_acl, $virus_lovers_re, $bypass_checks_re, %local_domains, @local_domains_acl, %mailto, ... ) NOTE: All lookups are performed with raw (rfc2821-unquoted and unbracketed) addresses as a key, i.e.: Bob "Funny" Dude@example.com not: "Bob \"Funny\" Dude"@example.com and not: <"Bob \"Funny\" Dude"@example.com> Several configurable settings in amavisd are controlled through the use of table lookups (hash/associative array), access control lists (array), Perl-regexp -based access control lists, SQL or LDAP lookups. The subroutine that does all the lookups is: sub lookup($$@) { my($get_all, $addr, @tables) = @_; It perform a lookup for a key (usually a recipient e-mail envelope address, unless otherwise noted) against one or more lookup tables / maps. The set of maps used to be hard-wired into the program (but no longer is), and the order chosen is: from specific to more general, and from faster to slower, which is usually flexible enough. Thus the default sequence of lookups: SQL, LDAP, hash, ACL, regexp, constant. The first that returns a definitive answer (not undef/NULL) stops the search. The SQL and LDAP are somewhat specific and are always consulted first. There can only be one (or none) SQL and one (or none) LDAP lookup. This is an implementational limitation, and might be lifted some day. The lists of static lookup tables are configurable since 20040701 (amavisd-new-2.0), and is controlled by array variables such as: @virus_admin_maps = (\%virus_admin, \$virus_admin); @viruses_that_fake_sender_maps = (\$viruses_that_fake_sender_re, 1); @spam_kill_level_maps = (\$sa_kill_level_deflt); @local_domains_maps = (\%local_domains, \@local_domains_acl, \$local_domains_re); @bypass_virus_checks_maps = (\%bypass_virus_checks, \@bypass_virus_checks_acl, \$bypass_virus_checks_re); @virus_lovers_maps = (\%virus_lovers, \@virus_lovers_acl, \$virus_lovers_re); See amavisd.conf-default for a complete list of these @*_maps variables. The above example shows that the default value of these arrays exactly corresponds to the formerly hard-wired search order. Users are free to leave these @*_maps variables at their default, referencing the legacy variables, or the list can be replaced entirely. There may be any number of lookup tables of any static type specified in these lists. Some restrain is warranted nevertheless for efficiency reasons - one lookup into larger lookup table is ofter quicker than two lookups into smaller ones. Some lookup maps can only return boolean result (e.g. ACL), other maps may return any value, which can be interpreted as boolean, numeric, string or possibly other. The result of some lookup maps (e.g. regexp) may include pieces of lookup key. If a match is found, the subroutine lookup() returns whatever the map returns; undef is returned if nothing matches (which for Perl is false as well). A CONSTANT Specifying a Perl scalar as an argument to lookup() is a degenerate case of a lookup table: it matches any key, and the value of the scalar is returned as the match value. Specifying a scalar argument in a call to lookup() (e.g. as the last element in @*_maps arrays) is useful as a last-resort (catchall, default) value. One level of indirection is alowed, so the following three cases are equivalent: $sa_kill_level_deflt = 6.0; @spam_kill_level_maps = (\%some_hash, \$sa_kill_level_deflt); and: $sa_kill_level_deflt = 6.0; @spam_kill_level_maps = (\%some_hash, $sa_kill_level_deflt); and: @spam_kill_level_maps = (\%some_hash, 6.0); The first case allows for the value of a scalar variable to be assigned even _after_ the assignment to @*_maps, so this still works as expected: @spam_kill_level_maps = (\%some_hash, \$sa_kill_level_deflt); $sa_kill_level_deflt = 6.0; but the following does not (it uses a value in the scalar variable at the time of assignment to the list, which is most likely not 6.0): @spam_kill_level_maps = (\%some_hash, $sa_kill_level_deflt); $sa_kill_level_deflt = 6.0; HASH LOOKUPS (associative array lookups) For arguments to subroutine lookup() of type hash-ref, the argument is passed to subroutine lookup_hash(), which does a lookup into a Perl hash. Hash lookups (e.g. for User+foo@Sub.Example.com) are performed in the following order: - lookup for user+foo@sub.example.com - lookup for user@sub.example.com (only if $recipient_delimiter is '+') - lookup for user+foo@ - lookup for user@ (only if $recipient_delimiter is '+') - lookup for sub.example.com - lookup for .sub.example.com - lookup for .example.com - lookup for .com - lookup for . The search sequence stops as soon as a match is found, and the value of the matched entry determines the result. The domain part of a key is always lowercased, the localpart is lowercase when $localpart_is_case_sensitive is true (not case-sensitive by default). The keys in a hash should match the case of a key, i.e. should typically be entirely in lowercase. A field value undef implies that the next lookup table (if there are more) is to be tried. In plain words, undef means "this table does not know the answer, try the next one". Further searching in this table (for possibly more general defaults) is terminated. NOTE: a null reverse path e-mail address used by MTA for delivery status notifications (DSN) has empty local part and empty domain. As far as the lookup is concerned (which uses raw, i.e. non-quoted and non-bracketed address form), this address is @, i.e. a single character "@". The lookup_hash for null address goes through the following sequence of keys: "", "@", "." (double quotes added for clarity, they are not part of the key). There is a subroutine read_hash() available for use in amavisd.conf. It can read keys from a plain text file, and load them into a Perl hash. Format of the text file: one address per line, anything from '#' to the end of line is treated as a comment, but '#' within correctly quoted rfc2821 addresses is not treated as a comment (e.g. a hash sign within "strange # \"foo\" address"@example.com is valid). Leading and trailing whitespace is discarded, empty lines (containing only whitespace and comment) are ignored. Addresses are converted from quoted form into internal (raw) form and inserted as keys into a given hash, with a value of 1 (true). Each address can have an associated optional value (also known as the 'righthand side' or RHS) separated from the address by whitespace. An absence of a value implies 1 (true). The $hashref argument is returned for convenience, so that one can say for example: $per_recip_whitelist_sender_lookup_tables = { '.my1.example.com' => read_hash({},'/var/amavis/my1-example-com.wl'), '.my2.example.com' => read_hash({},'/var/amavis/my2-example-com.wl') } LIST LOOKUPS (ACL) For arguments to subroutine lookup() of type array-ref, the argument is passed to subroutine lookup_acl(), which does an access list lookup: sub lookup_acl($$) { my($addr, $acl_ref) = @_; The supplied e-mail address is compared with each member of the lookup list in turn, the first match wins (terminates the search), and its value decides whether the result is true (yes, permit, pass) or false (no, deny, drop). Falling through without a match produces false (undef). Search is case-insensitive. NOTE: lookup_acl is not aware of address extensions and they are not handled specially! If a list element contains a '@', the full e-mail address is compared, otherwise if a list element has a leading dot, the domain name part is matched only, and the domain as well as its subdomains can match. If there is no leading dot, the domain must match exactly (subdomains do not match). The presence of character '!' prepended to a list element decides whether the result will be true (without a '!') or false (with '!') in case this list element matches and terminates the search. Because search stops at the first match, it only makes sense to place more specific patterns before the more general ones. Although not a special case, it is good to remember that '.' always matches, so a '.' would stop the search and return true, whereas '!.' would stop the search and return false (0). Examples: given: @acl = qw( me.ac.uk !.ac.uk .uk ) 'u@me.ac.uk' matches me.ac.uk, returns true and search stops given: @acl = qw( me.ac.uk !.ac.uk .uk ) 'u@you.ac.uk' matches .ac.uk, returns false (because of '!'), search stops given: @acl = qw( me.ac.uk !.ac.uk .uk ) 'u@them.co.uk' matches .uk, returns true and search stops given: @acl = qw( me.ac.uk !.ac.uk .uk ) 'u@some.com' does not match anything, falls through and returns undef given: @acl = qw( me.ac.uk !.ac.uk .uk !. ) 'u@some.com' similar to the previous, except it returns 0 instead of undef, which would only make a difference if this ACL is not the last argument in a call to lookup() given: @acl = qw( me.ac.uk !.ac.uk .uk . ) 'u@some.com' matches catchall ".", and returns true more complex example: @acl = qw( !The.Boss@dept1.xxx.com .dept1.xxx.com .dept2.xxx.com .dept3.xxx.com lab.dept4.xxx.com sub.xxx.com !.sub.xxx.com me.d.aaa.com him.d.aaa.com !.d.aaa.com .aaa.com ); Comparing hash (associative array) and ACL: For smaller sets of keys and if only boolean results are needed, both hash and ACL are appropriate. - hash is still effective for lots of keys, ACL search is linear; - hash can return any value, not just true or false; - hash can strip away address extension, ACL can not; - ACL appears simpler and more obvious for smaller sets; - ACL can accommodate arbitrarily nested if-then-elseif-then-...-else cases whereas hash only follows a fixed order of stripping addresses; ACL FOR IP ADDRESSES A special type of lookup is an IP-matching access list implemented by lookup_ip_acl(). It performs a lookup for an IP address against a list or an asssociative array (a hash) of IPv4 or IPv6 networks. It is used by amavisd for example to check if the SMTP client (normally your MTA) is allowed to connect, which is why it is sometimes called 'access control list' or ACL (the variable is @inet_acl). IP address is compared with each member of an access list in turn, the first match wins (terminates the search), and its value decides whether the result is true (yes, permit) or false (no, deny). Falling through without a match produces false (undef). The presence of character '!' prepended to a list member decides whether the result will be true (without a '!') or false (with '!') in case this list member matches and terminates the search. Because search stops at the first match, it only makes sense to place more specific patterns before the more general ones. Network can be specified in classless notation n.n.n.n/k, or using a mask n.n.n.n/m.m.m.m . Missing mask implies /32, i.e. a host address. Although not a special case, it is good to remember that '::/0' always matches any IPv4 or IPv6 address (even syntactically invalid address). The '0/0' is equivalent to '::FFFF:0:0/96' and matches any syntactically valid IPv4 address (including IPv4-mapped IPv6 addresses), but not other IPv6 addresses! Example given: @acl = qw( !192.168.1.12 172.16.3.3 !172.16.3.0/255.255.255.0 10.0.0.0/8 172.16.0.0/12 192.168.0.0/16 !0.0.0.0/8 !:: 127.0.0.0/8 ::1 ); matches rfc1918 private address space except host 192.168.1.12 and net 172.16.3/24 (but host 172.16.3.3 within 172.16.3/24 still matches). In addition, the 'unspecified' (null, i.e. all zeros) IPv4 and IPv6 addresses return false, and IPv4 and IPv6 loopback addresses match and return true. If the supplied lookup table is a hash reference, match a canonical IP address: dot-quad IPv4, or preferred IPv6 form, against hash keys. For IPv4 addresses a simple classful subnet specification is allowed in hash keys by truncating trailing bytes from the looked up IPv4 address. A syntactically invalid IP address can only match a hash entry with an undef key. Besides looking up full CIDR-style IPv4 or IPv6 lists, later versions of lookup_ip_acl() also make possible matching against a hash lookup table, which only allows for full addresses in canonical form (dotted-quad IPv4 addresses without leading zeroes or IPv6 addresses in canonical preferred form: x:x:x:x:x:x:x:x), or classful IPv4 subnets with truncated octets, such as: ('10.11.12.13'=>1, '192.168.1.2'=>0, '192.168'=>1, '127'=>1, '10'=>1) A convenient method of reading CIDR-style IP lists or a hash from a file if by calling provided routines read_array or read_hash, e.g.: @mynetworks_maps = (read_array('/etc/amavisd-mynetworks'), \@mynetworks); or: @mynetworks_maps = (read_hash('/etc/amavisd-mynetworks'), \@mynetworks); REGULAR EXPRESSION LOOKUPS For arguments to subroutine lookup() of type Amavis::Lookup::RE (objects), the object is passed to method lookup_re, which does a lookup into a list of Perl regular expressions (regexp or RE for short). The full unmodified e-mail address is always used, so splitting to localpart and domain or lowercasing is NOT performed. The regexp is powerful enough that this is unnecessary. The routine is useful for other RE tests, such as looking for banned file names. Each element of the list can be a ref to a pair, or directly a regexp ('Regexp' object created by qr operator, or just a (less efficient) string containing a regular expression). If it is a pair, the first element is treated as a regexp, and the second provides a return value in case the regexp matches. If not a pair, the implied result value of a match is 1. The regular expression is taken as-is, no implicit anchoring or setting case insensitivity is done, so do use a qr'(?i)^user@example\.com$', and not a sloppy qr'user@example.com', which can easily backfire. Also, if qr is used with a delimiter other than ' (apostrophe), make sure to quote the @ and $ . The pattern allows for capturing of parenthesized substrings, which can then be referenced from the result string using the $1, $2, ... notation, as with the Perl m// operator. The number after a $ may be a multi-digit decimal number. To avoid possible ambiguity the ${n} or $(n) form may be used. Substring numbering starts with 1. Nonexistent references evaluate to empty strings. If any substitution is done, the result inherits the taintedness of the key. Keep in mind that $ and @ characters needs to be backslash-quoted in qq() strings. Example: $virus_quarantine_to = new_RE( [ qr'^(.*)@example\.com$'i => 'virus-${1}@example.com' ], [ qr'^(.*)(@[^@]*)?$'i => 'virus-${1}${2}' ] ); Example (equivalent to the example in lookup_acl): $acl_re = new_re->new( qr'@me\.ac\.uk$'i, [ qr'[@.]ac\.uk$'i => 0 ], qr'\.uk$'i, ); ($r,$k) = $acl_re->lookup_re('user@me.ac.uk'); or $r = lookup('user@me.ac.uk', $acl_re); 'user@me.ac.uk' matches me.ac.uk, returns true and search stops 'user@you.ac.uk' matches .ac.uk, returns false (because of =>0) and search stops 'user@them.co.uk' matches .uk, returns true and search stops 'user@some.com' does not match anything, falls through and returns false (undef) NOTE: new_RE is a synonym (shorthand) for the internal subroutine Amavis::Lookup::RE::new See Perl documentation (or Google the Internet) for the description of Perl regular expressions. They are just enhanced version of Posix regular expressions, i.e. what egrep, awk and sed thrive on. Here are the most important constructs (simplified): . Match any character inter..t | Alternation alfa|beta|gamma () Grouping (pre|post)fix [] Set of characters (char. class) [Aa]lfa[0-9] ^ Match the beginning of the string ^MakeMoney $ Match the end of the string com$ \ Quote the next metacharacter \.com$ ^\$\$\$\+spam@\[127\.0\.0\.1\]$ most other characters just match themselves quantifiers may be placed after the pattern to modify its meaning from 'match itself exactly once' into: * Match 0 or more times ^alfa.*omega$ + Match 1 or more times alfa +beta ? Match 1 or 0 times (first)?aid {n} Match exactly n times 0{6} {n,} Match at least n times !{3,} {n,m} Match at least n but not more than m times SQL LOOKUPS For general SQL considerations and the interpretation of @lookup_sql_dsn please see documentation in README.sql . The amavisd.conf variable @lookup_sql_dsn controls access to the SQL server (dsn = data source name). If the list @lookup_sql_dsn is empty, no attempts to use SQL for lookups will be made, and no code to use DBI will be loaded or compiled (if @storage_sql_dsn is empty as well). For arguments to subroutine lookup() of type Amavis::Lookup::SQLfield (objects), the object is passed to a method lookup_sql_field, which does a lookup into an SQL table by using Perl module DBI. An SQL 'select' requests all available fields from the specified tables, and the result is cached (just for this mail message processing). Individual fields can be extracted one at a time from this cache very quickly, so there is no penalty in using several calls to lookup for different fields (for the same key) in different parts of the program. lookup_sql() performs a lookup for an e-mail address against an SQL map. If a match is found it returns whatever the map returns (a reference to a hash containing values of requested fields), otherwise returns undef. A match aborts further fetching sequence. lookup_sql_field() also performs a lookup for an e-mail address against an SQL map. It first calls lookup_sql() if it hasn't been called yet for this key, requesting it to return all matching records. Instead of returning the whole record as lookup_sql does, it returns just a value of one particular table field, the first one with a defined (non-NULL) value from the list of matching records (or undef if there are none). The lookup_sql_field() is the subroutine that gets called from lookup() for arguments (objects) of type Amavis::Lookup::SQLfield. A field value NULL is translated to a Perl undef, and is treated the same as a nonexistent field of the specified name. For any field (apart from the only exception field 'local') a record with a NULL/undef/nonexistent field is skipped and search continues with the next, more general, matching record until a defined value is found or all matching records are checked. The only exception is a field named 'local', where a nonexistent field on an otherwise matching record implies a value true - see section "Special handling of optional SQL field 'users.local'" further down. Boolean fields are usually represented as a single character (instead of an integer) to minimize storage. Characters N,n,F,f,0,NUL and SPACE represent false (0), any other character represents true. Trailing blanks are ignored. It is customary to use Y for true and N for false. SQL lookups (e.g. for user+foo@example.com) are performed in order which is usually specified by 'ORDER BY...DESC' in the SELECT statement; otherwise the order is unspecified, which is only useful if just specific entries exist in a database (e.g. full address always, not domain part only or mailbox parts only). The following order (implemented by sorting on the 'priority' field in DESCending order, zero is low priority) is recommended, to follow the same specific-to-general principle as in other lookup tables; the first column is a suggested priority (the exact value does not matter as long as the order is maintained): 9 - lookup for user+foo@sub.example.com 8 - lookup for user@sub.example.com (only if $recipient_delimiter is '+') 7 - lookup for user+foo (only if domain part is local) 6 - lookup for user (only local; only if $recipient_delimiter is '+') 5 - lookup for @sub.example.com 3 - lookup for @.sub.example.com 2 - lookup for @.example.com 1 - lookup for @.com 0 - lookup for @. (catchall) NOTE: this is different from hash and ACL lookups in two important aspects: - key without '@' implies a mailbox name, not a domain name; - naked mailbox name lookups (without an '@', e.g. 'user') are only performed when the mail address matches local_domains lookups. The domain part is always lowercased when constructing a key, the localpart is not lowercased when $localpart_is_case_sensitive is true. With SQL datatypes CHAR and VARCHAR this does not matter anyway, as such fields are matched case-insensitively. NOTE: a null reverse path e-mail address used by MTA for delivery status notifications (DSN) has empty local part and empty domain. As far as the lookup is concerned (which uses raw, i.e. non-quoted and non-bracketed address form), this address is @, i.e. a single character "@". The SQL lookup for null address goes through the following sequence of keys: "", "@", "@." (double quotes added for clarity, they are not part of the key). Table names and field names as used for SQL lookups are hard-wired in the routine prepare_sql_queries(). Please adjust it to will. Field names should be unique even without the table prefix. If they are not, the last one in the SELECT field list prevails. For an example schema that can be used with MySQL or PostgreSQL or SQLite see README.sql. Special handling of optional SQL field 'users.local' A special shorthand is provided when SQL lookups are used: when a match for recipient address (or domain) is found in SQL tables (regardless of field values), the recipient is considered local, regardless of static @local_comains_acl or %local_domains lookup tables. This simplifies life when a large number of dynamically changing domains is hosted. To overrule this behaviour, add an explicit boolean field 'local' to table 'users' (missing field defaults to true, meaning record match implies the recipient is local; a NULL field 'local' is not special, it is interpreted as undef like other NULL fields, causing search to continue into other lookup tables). Since amavisd-new-20030616-p7: changed the default value for local_domains_sql lookup for the catchall key '@.' under conditions: when user record with key '@.' is present in the database and a field 'local' is not present. Previously it surprisingly defaulted to true, now it falls back to static lookup table defaults, the same as if the record '@.' were not present in the table or as if the field value 'local' was NULL. Case sentitivity of string comparison Amavisd-new expects string comparison to be case sensitive (but does not mind if it isn't). When forming a SELECT clause it lowercases parts of keys that are supposed to be case-insensitive, such as the domain name. The local part of the e-mail address in SQL search keys is lowercased if and only if the $localpart_is_case_sensitive variable is false (which is a default). This means that case-insensitive parts of e-mail addresses as kept in the SQL database should be in lower case, otherwise match may fail, depending on SQL server behaviour and the use of BINARY prefix in string data types. Since MySQL version 3.23.0 it is possible to declare a data type of a column as BINARY, forcing string comparision to be case sensitive, as it is in PostgreSQL. This is only required for sites that want to treat localpart as case-sentitive and have $localpart_is_case_sensitive true.