BeautifulSoup и несколько абзацев

Question

BeautifulSoup и несколько абзацев

Я пытаюсь очистить речь с веб-сайта с помощью BeautifulSoup. Однако у меня возникают проблемы, так как речь разбита на много разных абзацев. Я новичок в программировании, и мне сложно понять, как с этим справиться. HTML-код страницы выглядит так:

<span class="displaytext">Thank you very much. Mr. Speaker, Vice President Cheney, 
Members of Congress, distinguished guests, fellow citizens: As we gather tonight, our Nation is    
at war; our economy is in recession; and the civilized world faces unprecedented dangers. 
Yet, the state of our Union has never been stronger.
<p>We last met in an hour of shock and suffering. In 4 short months, our Nation has comforted the victims, 
begun to rebuild New York and the Pentagon, rallied a great coalition, captured, arrested, and  
rid the world of thousands of terrorists, destroyed Afghanistan's terrorist training camps, 
saved a people from starvation, and freed a country from brutal oppression. 
<p>The American flag flies again over our Embassy in Kabul. Terrorists who once occupied 
Afghanistan now occupy cells at Guantanamo Bay. And terrorist leaders who urged followers to 
sacrifice their lives are running for their own.

Так продолжается некоторое время с несколькими тегами абзацев. Я пытаюсь извлечь весь текст в пределах диапазона.

Я пробовал несколько разных способов получить текст, но оба не смогли получить нужный мне текст.

Первое, что я попробовал, это:

import urllib2,sys
from BeautifulSoup import BeautifulSoup, NavigableString

address = 'http://www.presidency.ucsb.edu/ws/index.php?pid=29644&st=&st1=#axzz1fD98kGZW'
html = urllib2.urlopen(address).read()

soup = BeautifulSoup(html)
thespan = soup.find('span', attrs={'class': 'displaytext'})
print thespan.string

, что дает мне:

Mr. Спикер, вице-президент Чейни, члены Конгресса, уважаемые гости, сограждане: Когда мы собираемся сегодня вечером, наша нация находится в состоянии войны; наша экономика переживает спад; а цивилизованный мир сталкивается с беспрецедентными опасностями. Тем не менее, состояние нашего Союза никогда не было таким сильным.

Это часть текста до тега первого абзаца. Затем я попробовал:

import urllib2,sys
from BeautifulSoup import BeautifulSoup, NavigableString

address = 'http://www.presidency.ucsb.edu/ws/index.php?pid=29644&st=&st1=#axzz1fD98kGZW'
html = urllib2.urlopen(address).read()

soup = BeautifulSoup(html)
thespan = soup.find('span', attrs={'class': 'displaytext'})
for section in thespan:
     paragraph = section.findNext('p')
     if paragraph and paragraph.string:
         print '>', paragraph.string
     else:
         print '>', section.parent.next.next.strip()

Это дало мне текст между тегом первого абзаца и тегом второго абзаца. Итак, я ищу способ получить весь текст, а не только разделы.

9

python beautifulsoup web-scraping

задан user1074057 30 November 2011 в 21:18

0 ответов

Другие вопросы по тегам:

python beautifulsoup web-scraping

BeautifulSoup и несколько абзацев

0 ответов

Похожие вопросы: