recursive crawling with Python and Scrapy

I'm using scrapy to crawl a site. The site has 15 listings per page and then has a next button. I am running into an issue where my Request for the next link is being called before I am finished parsing all of my listings in pipeline. Here is the code for my spider:

class MySpider(CrawlSpider):
    name = 'mysite.com'
    allowed_domains = ['mysite.com']
    start_url = 'http://www.mysite.com/'

    def start_requests(self):
        return [Request(self.start_url, callback=self.parse_listings)]

    def parse_listings(self, response):
        hxs = HtmlXPathSelector(response)
        listings = hxs.select('...')

        for listing in listings:
            il = MySiteLoader(selector=listing)
            il.add_xpath('Title', '...')
            il.add_xpath('Link', '...')

            item = il.load_item()
            listing_url = listing.select('...').extract()

            if listing_url:
                yield Request(urlparse.urljoin(response.url, listing_url[0]),
                              meta={'item': item},
                              callback=self.parse_listing_details)

        next_page_url = hxs.select('descendant::div[@id="pagination"]/'
                                   'div[@class="next-link"]/a/@href').extract()
        if next_page_url:
            yield Request(urlparse.urljoin(response.url, next_page_url[0]),
                          callback=self.parse_listings)


    def parse_listing_details(self, response):
        hxs = HtmlXPathSelector(response)
        item = response.request.meta['item']
        details = hxs.select('...')
        il = MySiteLoader(selector=details, item=item)

        il.add_xpath('Posted_on_Date', '...')
        il.add_xpath('Description', '...')
        return il.load_item()

These lines are the problem. Like I said before, they are being executed before the spider has finished crawling the current page. On every page of the site, this causes only 3 out 15 of my listings to be sent to the pipeline.

     if next_page_url:
            yield Request(urlparse.urljoin(response.url, next_page_url[0]),
                          callback=self.parse_listings)

This is my first spider and might be a design flaw on my part, is there a better way to do this?

12
задан imns 8 March 2011 в 02:34
поделиться