HtmlUnit и XPath: DOMNode.getByXPath работает только с HtmlPage?

Я пытаюсь для синтаксического анализа страницы со ссылками на статьи, важное содержание которых выглядит следующим образом:

Performing Arts

EIF theatre review: Sin Sangre | The Man Who Fed Butterflies | Caledonia | Songs Of Ascension | Vieux Carré | The Gospel At Colonus The EIF's theatre programme wasn't as far-reaching as it could have been, but did find an exoticism in the familiar, writes Mark Fisher

Вот минимальный случай парсинга в Java с использованием HtmlUnit и XPath (импорт удален для краткости):

public class MinimalTest {
    public static void main(String[] args) throws Exception {
        WebClient client = new WebClient();
        client.setJavaScriptEnabled(false);
        client.setCssEnabled(false);
        System.out.println("Fetching front page");
        HtmlPage frontPage = client.getPage("http://living.scotsman.com/sectionhome.aspx?sectionID=7063");
        List articleInfos = extractArticleInfo(frontPage);

        for (ArticleInfo info : articleInfos)
        {
            System.out.println("Title: " + info.getTitle());
            System.out.println("Intro: " + info.getFirstPara());
            System.out.println("Link: " + info.getLink());
        }
    }

    @SuppressWarnings("unchecked") // xpath returns List
    private static List extractArticleInfo(HtmlPage frontPage) {
        System.out.println("Extracting article links");
        List articleDivs = (List) frontPage.getByXPath("//div[@class='article']");
        System.out.println(String.format("Found %d articles", articleDivs.size()));
        List articleLinks = new ArrayList(articleDivs.size());
        for (HtmlDivision div : articleDivs) {
            articleLinks.add(ArticleInfo.constructFromArticleDiv(div));
        }
        return articleLinks;
    }

    private static class ArticleInfo {
        private final String title;
        private final String link;
        private final String firstPara;

        public ArticleInfo(final String link, final String title, final String firstPara) {
            this.link = link;
            this.title = title;
            this.firstPara = firstPara;
        }
        public static ArticleInfo constructFromArticleDiv(final HtmlDivision div) {
            String link = ((DomText) div.getFirstByXPath("//a/@href/text()")).asText();
            String title = ((DomText) div.getFirstByXPath("//span[@class='mth3']/text()")).asText();
            String firstPara = ((DomText) div.getFirstByXPath("//span[@class='mtp']/text()")).asText();
            return new ArticleInfo(link, title, firstPara);
        }
        public String getTitle() {
            return title;
        }
        public String getFirstPara() {
            return firstPara;
        }
        public String getLink() {
            return link;
        }
    }  
}

Ожидаемый результат:

Title: EIF theatre review: Sin Sangre | The Man Who Fed Butterflies | Caledonia | Songs Of Ascension | Vieux Carré | The Gospel At Colonus 
Intro: The EIF's theatre programme wasn't as far-reaching as it could have been, but did find an exoticism in the familiar, writes Mark Fisher 
Link: http://living.scotsman.com/performing-arts/EIF-theatre-review-Sin-Sangre.6517348.jp

Что я получаю:

Fetching front page
Extracting article links
Found 24 articles
Exception in thread "main" java.lang.NullPointerException
    at com.allthefestivals.app.crawler.MinimalTest$ArticleInfo.constructFromArticleDiv(MinimalTest.java:68)
    at com.allthefestivals.app.crawler.MinimalTest.extractArticleInfo(MinimalTest.java:50)
    at com.allthefestivals.app.crawler.MinimalTest.main(MinimalTest.java:30)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:115)

Вызов getByXPath отлично работает на HtmlPage , но, похоже, ничего не возвращает на любом другом HtmlElement . Какой' S не так? Это ошибка или пробел в реализации в HtmlUnit, или мне не хватает чего-то тонкого в синтаксисе XPath?

Связанный вопрос, решение которого у меня не сработало: XPath _relative_ к данному элементу в HTMLUnit / Groovy?

1
задан Community 23 May 2017 в 12:01
поделиться