I've got an html file that has some text that looks like this (after running it through lxml.html parse
, lxml.html clean
, and this is the result of etree.tostring(table, pretty_print=True)
)
224
9:00 am
-3:00 pm
NPHC Leadership
ALSO IN 223; WALL OPEN
The documentation that I've found on lxml has been somewhat spotty. I've been able to do quite a bit to get to this point, but what I would like to do is strip out all the tags except , , and . I would also like to strip all the attributes from those tags, and I would also like to get rid of the entities, such as
.
To strip the attributes currently I use:
etree.strip_attributes(tree, 'width', 'href', 'style', 'onchange',
'ondblclick', 'class', 'colspan', 'cols',
'border', 'align', 'color', 'value',
'cellpadding', 'nowrap', 'selected',
'cellspacing')
which works fine, but it seems like there should be a better way. It seems like there should be some fairly simple methods to do what I want, but I haven't been able to find any examples that worked right for me.
I tried using Cleaner
, but when I passed it allow_tags
, like this:
error: Cleaner(allow_tags=['table', 'td', 'tr']).clean_html(tree)
it gave me this error:
ValueError: It does not make sense to pass in both allow_tags and remove_unknown_tags
. Also, when I add remove_unkown_tags=False
I get this error:
Traceback (most recent call last):
File "parse.py", line 73, in
SParser('schedule.html').test()
File "parse.py", line 38, in __init__
self.clean()
File "parse.py", line 42, in clean
Cleaner(allow_tags=['table', 'td', 'tr'], remove_unknown_tags=False).clean_html(tree)
File "/usr/lib/python2.6/dist-packages/lxml/html/clean.py", line 488, in clean_html
self(doc)
File "/usr/lib/python2.6/dist-packages/lxml/html/clean.py", line 390, in __call__
el.drop_tag()
File "/usr/lib/python2.6/dist-packages/lxml/html/__init__.py", line 191, in drop_tag
assert parent is not None
AssertionError
So, to sum up:
- I want to remove HTML entities, such as
- I want to remove all tags except
, , and
- I want to remove all the attributes from the remaining tags.
Any help would be greatly appreciated!
5
задан Wayne Werner 3 May 2011 в 20:01
поделиться
0 ответов
Другие вопросы по тегам: Похожие вопросы: