Parsing html -> xml and querying with Xpath

I want to parse a html page to get some data. First, I convert it to XML document using SgmlReader. Then, I load the result to XMLDocument and then navigate through XPath:

//contains html document
var loadedFile = LoadWebPage();

...

Sgml.SgmlReader sgmlReader = new Sgml.SgmlReader();
sgmlReader.DocType = "HTML";
sgmlReader.WhitespaceHandling = WhitespaceHandling.All;
sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;

sgmlReader.InputStream = new StringReader(loadedFile);

XmlDocument doc = new XmlDocument();
doc.PreserveWhitespace = true;
doc.XmlResolver = null;
doc.Load(sgmlReader);

This code works fine for most cases, except on this site - www.arrow.com (try to search something like OP295GS). I can get a table with result using the following XPath:

var node = doc.SelectSingleNode(".//*[@id='results-table']");

This gives me a node with several child nodes:

[0]         {Element, Name="thead"}  
[1]         {Element, Name="tbody"}  
[2]         {Element, Name="tbody"}  
FirstChild   {Element, Name="thead"}

Ok, let's try to get some child nodes using XPath. But this doesn't work:

var childNodes = node.SelectNodes("tbody");
//childnodes.Count = 0

This also:

var childNode = node.SelectSingleNode("thead");
// childNode = null

And even this:

var childNode = doc.SelectSingleNode(".//*[@id='results-table']/thead")

What can be wrong in Xpath queries?


I've just tried to parse that HTML page with Html Agility Pack and my XPath queries work good. But my application use XmlDocument inside, Html Agility Pack doesn't suit me.


I even tried the following trick with Html Agility Pack, but Xpath queries doesn't work also:

//let's parse and convert HTML document using HTML Agility Pack and then load
//the result to XmlDocument
HtmlDocument xmlDocument = new HtmlDocument();
xmlDocument.OptionOutputAsXml = true;
xmlDocument.Load(new StringReader(webPage));

XmlDocument document = new XmlDocument();
document.LoadXml(xmlDocument.DocumentNode.InnerHtml);

Perhaps, web page contains errors (not all tags are closed and so on), but in spite of this I can see child nodes (through Quick Watch in Visual Studio), but cannot access them through XPath.


My XPath queries works correctly in Firefox + FirePath + XPather plugins, but don't work in .net XmlDocument :(

7
задан Cœur 29 April 2017 в 16:31
поделиться