Parsing HTML with Lxml

Question

Parsing HTML with Lxml

I need help parsing out some text from a page with lxml. I tried beautifulsoup and the html of the page I am parsing is so broken, it wouldn't work. So I have moved on to lxml, but the docs are a little confusing and I was hoping someone here could help me.

Here is the page I am trying to parse, I need to get the text under the "Additional Info" section. Note, that I have a lot of pages on this site like this to parse and each pages html is not always exactly the same (might contain some extra empty "td" tags). Any suggestions as to how to get at that text would be very much appreciated.

Thanks for the help.

14

python html parsing lxml

задан RivieraKid 17 August 2011 в 21:25

1 ответ

Другие вопросы по тегам:

python html parsing lxml

Похожие вопросы:

score 15 · Accepted Answer

import lxml.html as lh
import urllib2

def text_tail(node):
    yield node.text
    yield node.tail

url='http://bit.ly/bf1T12'
doc=lh.parse(urllib2.urlopen(url))
for elt in doc.iter('td'):
    text=elt.text_content()
    if text.startswith('Additional  Info'):
        blurb=[text for node in elt.itersiblings('td')
               for subnode in node.iter()
               for text in text_tail(subnode) if text and text!=u'\xa0']
        break
print('\n'.join(blurb))

дает

На протяжении более 65 лет морская пехота Карла Стирна устанавливает новые стандарты превосходство и сервис для лодок удовольствие. Потому что мы предлагаем качество товар, забота, добросовестность, продаж и обслуживания, мы смогли сделать наших клиентов нашими хорошими друзья.

Наше предприятие площадью 26 000 кв. футов включает в себя комплектные детали и аксессуары отдел, отдел полного обслуживания (дилер Merc. Premier с 2 штатными Mercruiser Master Tech's), и новые, б/у и брокерские продажи.

Редактировать: Вот альтернативное решение, основанное на xpath Стивена Д. Маевски, которое относится к комментарию ОП о том, что количество тегов, отделяющих «Дополнительную информацию» от рекламного объявления, может быть неизвестно:

import lxml.html as lh
import urllib2

url='http://bit.ly/bf1T12'
doc=lh.parse(urllib2.urlopen(url))

blurb=doc.xpath('//td[child::*[text()="Additional  Info"]]/following-sibling::td/text()')

blurb=[text for text in blurb if text != u'\xa0']
print('\n'.join(blurb))