Re: Non English Spam



Beech Rintoul wrote:
I'm getting a ton of spam every day that comes from China, Japan and Korea. Spam Assassin completely ignores it because it has all non-english characters and slows kmail to a crawl loading. Is there a way to filter on non-english either using Spam Assassin or procmail?

I get none after adding simple filter rules for postfix:

# Accepted mime headers: (ASCII, UTF-8 and ISO-8859-X)
/^Content-Type:.*?charset\s*=\s*"?(us-ascii|iso-8859-\d+|utf-8)"?/
OK HDR2000 Accepted charset: $1

Strictly you can reject every other characterset, but I chose to make it explicit:

# Reject specific character sets
# Chinese, Japanese and Korean
/^Content-Type:.*?charset\s*=\s*"?(Big5|gb2312|euc-cn)"?/
REJECT HDR2100: Unaccepted character set: "$1"
/^Content-Type:.*?charset\s*=\s*"?(euc-kr|iso-2022-kr)"?/
REJECT HDR2110: Unaccepted character set: "$1"
/^Content-Type:.*?charset\s*=\s*"?(iso-2022-\w+|euc-jp|shift_jis)"?/
REJECT HDR2120: Unaccepted character set: "$1"
# Cyrrilic character sets: Russian/Ukrainian
/^Content-Type:.*?charset\s*=\s*"?(koi8-(?:r|u))"?/
REJECT HDR2200: Unaccepted character set: "$1"
/^Content-Type:.*?charset\s*=\s*"?(windows-(?:1250|1251))"?/
REJECT HDR2210: Unaccepted character set: "$1"

And then you may want a catchup rule to catch unknown character sets.

/^Content-Type:.*?charset\s*=\s*"?(\w?)"?/
WARN HDR2299: Unknown character set: "$1"

you may change WARN to REJECT.

I have noted however, that some subscribers to this list write english encoded in one of the above character sets, I don't know enough about the character set definition, but it seems that English characters are a subset of any character set?

What is the recommended policy here? Should subscribers be advised to change character set when posting to the list?

Cheers, Erik
--
Ph: +34.666334818 web: http://www.locolomo.org
X.509 Certificate: http://www.locolomo.org/crt/8D03551FFCE04F0C.crt
Key ID: 69:79:B8:2C:E3:8F:E7:BE:5D:C3:C3:B1:74:62:B8:3F:9F:1F:69:B9
_______________________________________________
freebsd-questions@xxxxxxxxxxx mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questions-unsubscribe@xxxxxxxxxxx"



Relevant Pages

  • Re: Segfault City
    ... me names and making weird claims about how I assasinate his character ... for a digit (which makes your statement wrong and a lie because isdigit ... He's checking for a digit by calling ... Correct (assuming "coding systems" means character sets, ...
    (comp.lang.c)
  • [OT] Re: wchar_t
    ... From what I'm told, people in those countries don't seem to care about the subtle problems that causes, and have gone full steam ahead with dropping Big5 and adopting Unicode pretty pervasively. ... It can say "21 bits is enough for every character known to man", ... A glyph is what you see on your screen, and it may have many nice properties by which it is affected, including the formatting characteristics you describe. ... Actually, "glyph sets" were in common use for display on dumb terminals with hardwired character sets. ...
    (comp.lang.c)
  • Re: case-sensitivity
    ... This leaves no room for future expansion of the character properties. ... which is true for many languages in the current encoding). ... If character sets don't suit the needs of logographic writing systems, ...
    (comp.lang.scheme)
  • Re: UCS Identifiers and compilers
    ... the language, particularly in identifiers. ... Its identifiers and comments are explicitly Unicode based. ... Also, if you want to process langauges in many foreign character sets, ...
    (comp.compilers)
  • Re: Why is Fortran not case-sensitive?
    ... Remembering back not so very long ago, still while 7-bit character sets ... But since the zero often looked ... You could add capital letter 'I' to that set too, ...
    (comp.lang.fortran)