scrapy python, похоже, не получает данные со всех доступных URL-адресов

Я пытаюсь очистить thesession.org , чтобы создать таблицу, в которой указано, сколько раз каждая мелодия добавлялась в сборники мелодий участника, чтобы я мог найти несколько популярных штук, чтобы узнать. Я начал с учебника по scrapy здесь и пытаюсь изменить его для своих целей. Проблема в том, что, хотя на сайте thesession.org, похоже, есть около 10 390 мелодий, мой скребок возвращает данные только по 10 из них (только те, что на http://www.thesession.org/tunes/index.php ).Как я могу получить данные обо всех мелодиях (или сотне лучших мелодий)? Любой совет будет очень признателен.

Вот что у меня есть на данный момент:

items.py

from scrapy.item import Item, Field

class tuneItem(Item):
    url = Field()
    name1 = Field()
    name2 = Field()
    key = Field()
    count = Field() 
    pass

tune_spider.py

from scrapy.spider import BaseSpider
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from tutorial.items import tuneItem
from scrapy.conf import settings

class tunesSpider(CrawlSpider):

    name = "irishtunes"
    allowed_domains = ["thesession.org"]
    start_urls = ["http://www.thesession.org/tunes"]
    rules = [Rule(SgmlLinkExtractor(allow=['/display/\d+'], deny=['/members/','/recordings/','/index/','/display/\d+/.']), 'parse_tune')]

    def parse_tune(self, response):
        x = HtmlXPathSelector(response)

        tune = tuneItem()
        tune['url'] = response.url
        tune['name1'] = x.select("//div[@id='details']//div[@class='box']/h1/text()").extract()
        tune['name2'] = x.select("//div[@id='details']//div[@class='box']/h2/text()").extract()
        tune['key']   = x.select("//div[@id='details']//div[@class='box']/p[1]/text()").extract()
        tune['count'] = x.select("//div[@id='details']//div[@class='box']/p[3]/text()").re('\d+')
        return tune

Я запускаю парсер, открывая консоль, перехожу в каталог, содержащий cfg-файл учебника, и запускаю scrapy сканировать irishtunes --set FEED_URI = scraped_data.csv --set FEED_FORMAT = csv

Вот что я получаю:

C:\Users\BM\Desktop\scrape\tutorial>scrapy crawl irishtunes --set FEED_URI=scrap
ed_data.csv --set FEED_FORMAT=csv
2011-11-25 22:45:47-0800 [scrapy] INFO: Scrapy 0.14.0.2841 started (bot: tutoria
l)
2011-11-25 22:45:47-0800 [scrapy] DEBUG: Enabled extensions: FeedExporter, LogSt
ats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2011-11-25 22:45:48-0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAut
hMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, De
faultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMi
ddleware, ChunkedTransferMiddleware, DownloaderStats
2011-11-25 22:45:48-0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMi
ddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddle
ware
2011-11-25 22:45:48-0800 [scrapy] DEBUG: Enabled item pipelines:
2011-11-25 22:45:48-0800 [irishtunes] INFO: Spider opened
2011-11-25 22:45:48-0800 [irishtunes] INFO: Crawled 0 pages (at 0 pages/min), sc
raped 0 items (at 0 items/min)
2011-11-25 22:45:48-0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:602
3
2011-11-25 22:45:48-0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2011-11-25 22:45:48-0800 [irishtunes] DEBUG: Redirecting (301) to  from 
2011-11-25 22:45:48-0800 [irishtunes] DEBUG: Crawled (200)  (referer: None)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200)  (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11602>
        {'count': [u'1'],
         'key': [u'Key signature: Dmajor'],
         'name1': [u"Brendan Begley's"],
         'name2': [u'polka'],
         'url': 'http://www.thesession.org/tunes/display/11602'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200)  (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11593>
        {'count': [u'3'],
         'key': [u'Key signature: Amajor'],
         'name1': [u'Carleton County Breakdown'],
         'name2': [u'reel'],
         'url': 'http://www.thesession.org/tunes/display/11593'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200)  (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11597>
        {'count': [u'3'],
         'key': [u'Key signature: Dmajor'],
         'name1': [u"Kasper's Rant"],
         'name2': [u'hornpipe'],
         'url': 'http://www.thesession.org/tunes/display/11597'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200)  (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11594>
        {'count': [u'5'],
         'key': [u'Key signature: Gmajor'],
         'name1': [u'The Full Of The Bag'],
         'name2': [u'hornpipe'],
         'url': 'http://www.thesession.org/tunes/display/11594'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200)  (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11599>
        {'count': [u'1'],
         'key': [u'Key signature: Adorian'],
         'name1': [u'The New Steamboat'],
         'name2': [u'reel'],
         'url': 'http://www.thesession.org/tunes/display/11599'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200)  (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11598>
        {'count': [u'4'],
         'key': [u'Key signature: Gmajor'],
         'name1': [u"Galen's Arrival"],
         'name2': [u'reel'],
         'url': 'http://www.thesession.org/tunes/display/11598'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200)  (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11596>
        {'count': [u'2'],
         'key': [u'Key signature: Amixolydian'],
         'name1': [u'Culloden Day'],
         'name2': [u'strathspey'],
         'url': 'http://www.thesession.org/tunes/display/11596'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200)  (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11595>
        {'count': [u'2'],
         'key': [u'Key signature: Aminor'],
         'name1': [u'Miss Sine Flemington'],
         'name2': [u'barndance'],
         'url': 'http://www.thesession.org/tunes/display/11595'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200)  (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11600>
        {'count': [u'2'],
         'key': [u'Key signature: Dmajor'],
         'name1': [u"Joan Martin's"],
         'name2': [u'polka'],
         'url': 'http://www.thesession.org/tunes/display/11600'}
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Crawled (200)  (referer: http://www.thesession.org/tunes/)
2011-11-25 22:45:49-0800 [irishtunes] DEBUG: Scraped from <200 http://www.theses
sion.org/tunes/display/11601>
        {'count': [u'2'],
         'key': [u'Key signature: Gmajor'],
         'name1': [u'My Time Inside 2005'],
         'name2': [u'waltz'],
         'url': 'http://www.thesession.org/tunes/display/11601'}
2011-11-25 22:45:49-0800 [irishtunes] INFO: Closing spider (finished)
2011-11-25 22:45:49-0800 [irishtunes] INFO: Stored csv feed (10 items) in: scrap
ed_data.csv
2011-11-25 22:45:49-0800 [irishtunes] INFO: Dumping spider stats:
        {'downloader/request_bytes': 3655,
         'downloader/request_count': 12,
         'downloader/request_method_count/GET': 12,
         'downloader/response_bytes': 31620,
         'downloader/response_count': 12,
         'downloader/response_status_count/200': 11,
         'downloader/response_status_count/301': 1,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2011, 11, 26, 6, 45, 49, 500000),
         'item_scraped_count': 10,
         'request_depth_max': 1,
         'scheduler/memory_enqueued': 12,
         'start_time': datetime.datetime(2011, 11, 26, 6, 45, 48, 10000)}
2011-11-25 22:45:49-0800 [irishtunes] INFO: Spider closed (finished)
2011-11-25 22:45:49-0800 [scrapy] INFO: Dumping global stats:
        {}

РЕДАКТИРОВАТЬ: Ответ от @reclosedev заставил меня задуматься. Для тех, кто интересуется результатом, вот снимок ...

(1) Подавляющее большинство мелодий - это сборники песен менее 10 участников

enter image description here

(2) Популярность всех 10 379 мелодий, которые я смог поскрести с сайта (измеряется количеством мелодий, в которых они находятся) следует распределению по степенному закону

enter image description here

(3). А вот мелодии, которые находятся в> 1000 сборников на сайте, с указанием названий самых популярных мелодий и количества мелодии, они есть в

enter image description here

5
задан Ben 27 November 2011 в 07:00
поделиться