Hi all. I was hoping to lend on the generosity of the ST community to assist with an issue I am facing.
I am developing a ST plugin to parse the contents of an XML document. I can get it to work using BS4 and it’s standard HTML parser, but it gives this notification in this console:
XMLParsedAsHTMLWarning: It looks like you’re using an HTML parser to parse an XML document.
Assuming this really is an XML document, what you’re doing might work, but you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the Python package ‘lxml’ installed, and pass the keyword argument
features="xml"
into the BeautifulSoup constructor.
I have my .python_version set to 3.8 and have added the lxml package to my dependencies.json, like so:
{ "*": { "*": [ "bs4","soupsieve","lxml"] } }
When I attempt to use this new XML parser, I get the following error in the console:
bs4.exceptions.FeatureNotFound: Couldn’t find a tree builder with the features you requested: xml. Do you need to install a parser library?
I have tried multiple methods to invoke this new parser, all of which fail with the same error. Here is what I have tried:
soup = BeautifulSoup(xml_content, 'xml')
soup = BeautifulSoup(xml_content, 'lxml')
soup = BeautifulSoup(xml_content, 'lxml-xml')
soup = BeautifulSoup(xml_content, features="xml")
None seem to work. Anyone got any ideas about how to resolve this, or if it is even worth the effort? My reason for NOT using the html parser is the the XML contains <![CDATA[ .... ]]>
and the HTML parser doesn’t like this much.
Thanks in advance.