Mailing List Archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[tlug] Do you whitelist or blacklist utf-8?



TLUG,

I've been going a little mental today trying to figure out how to filter
out possible malicious characters from POST data going to my site. I
want to block things like <,>, *. etc...

The thing is that I also want to be able to allow CJK characters, and
any other language with non-Latin characters. This is a snap to do if
you just want to allow 0-9a-zA-Z. But once you get into Unicode land, it
seems to be a whole other ballgame.

I've got three stages I want to filter on. First I want to block
characters on the client side with Javascript, so that the user is aware
of what characters are permissible when entering names and whatnot. Then
I want to block any bad characters on the server side in PHP to make
sure no script kiddies have tried to POST anything nasty. And also, just
for good measure, I want to ensure no nastiness is inserted into my MySQL.

I'd like all three steps to be consistent with each other, so I'm trying
to standardize a set of bad characters that I can filter for at each step.

However, where I've broken down is whether or not I should blacklist bad
characters (where I fear I might miss one), whitelist good characters
(seems tough to get a whitelist that's utf-8 compatible), or do
something like make comparisons on HTML entities or with regex or
something using built in functions (PHP and Javascript differ on
specific functions and their results).

Since you guys are the go-to people for handling utf-8 text, I thought
maybe you've encountered this before.

How do you handle filtering malicious code from utf-8 text that contains
CJK and other languages?

And how do you do it in PHP and Javascript?

-- 
Dave M G


Home | Main Index | Thread Index

Home Page Mailing List Linux and Japan TLUG Members Links