Monday, June 13, 2011

News Forum Rich Text Sanitizing with Bleach

(playdoh)haoqili@host-3-248:11:50:59:~/dev/playdoh/playdoh/playdoh$ ./ shell
Python 2.7.1 (r271:86832, Jun  6 2011, 13:57:48)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import bleach
Traceback (most recent call last):
  File "<console>", line 1, in <module>
ImportError: No module named bleach
(playdoh)haoqili@host-3-248:12:00:06:~/dev/playdoh/playdoh/playdoh$ pip install -e git://
Obtaining bleach from git+git://
  Cloning git:// to /Users/haoqili/.virtualenvs/playdoh/src/bleach
  Running egg_info for package bleach
Downloading/unpacking html5lib (from bleach)
  Downloading (99Kb): 99Kb downloaded
  Running egg_info for package html5lib
Installing collected packages: bleach, html5lib
  Running develop for bleach
    Creating /Users/haoqili/.virtualenvs/playdoh/lib/python2.7/site-packages/bleach.egg-link (link to .)
    Adding bleach 1.0.2 to easy-install.pth file
    Installed /Users/haoqili/.virtualenvs/playdoh/src/bleach
  Running install for html5lib
Successfully installed bleach html5lib
Cleaning up...
(playdoh)haoqili@host-3-248:12:00:54:~/dev/playdoh/playdoh/playdoh$ ./ shellPython 2.7.1 (r271:86832, Jun  6 2011, 13:57:48)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import bleach
>>> bleach.clean('an <script>evil()</script> example')
u'an &lt;script&gt;evil()&lt;/script&gt; example'
>>> bleach.linkify('an url')
u'an <a href="" rel="nofollow"></a> url'

jsocol: the "u" indicates that it's a unicode string instead of a simple bytestring, it's a python thing

jsocol: haoqili: there's no Bleach() class anymore, it's just "import bleach" or "from bleach import clean, linkify"

haoqili: I'm looking at
[12:25pm] jsocol: ok
[12:25pm] haoqili: What is the significance of TLDS = """ac ad ae aero af ag ai al am an ao aq ar arpa as asia at au aw ax az
[12:25pm] haoqili: ba bb bd be bf bg bh bi biz bj bm bn bo br bs bt bv bw by bz ca cat?
[12:25pm] haoqili: i.e. what is TLDS?
[12:26pm] jsocol: TLD = Top Level Domain, it's supposed to be an exhaustive list of all the current, valid TLDs
[12:26pm] jsocol: so that, for example, "" or "" gets linkified, but "example.txt" does not
[12:26pm] haoqili: ah
[12:26pm] haoqili: okay
[12:27pm] haoqili: I'm also trying to understand how clean's ALLOWED_TAGS get added to the pre-defined ALLOWED_TAGS
[12:28pm] haoqili: is it added or replaced?
[12:28pm] haoqili: I guess it's replaced?
[12:28pm] jsocol: replaced. if you pass in a tags= kwarg, your list supercedes the default list, so you can be more restrictive

