Sublime Forum

Seeking to use BeautifulSoup XML Parser in a ST plugin

#1

Hi all. I was hoping to lend on the generosity of the ST community to assist with an issue I am facing.

I am developing a ST plugin to parse the contents of an XML document. I can get it to work using BS4 and it’s standard HTML parser, but it gives this notification in this console:

XMLParsedAsHTMLWarning: It looks like you’re using an HTML parser to parse an XML document.

Assuming this really is an XML document, what you’re doing might work, but you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the Python package ‘lxml’ installed, and pass the keyword argument features="xml" into the BeautifulSoup constructor.

I have my .python_version set to 3.8 and have added the lxml package to my dependencies.json, like so:

{ "*": { "*": [ "bs4","soupsieve","lxml"] } }

When I attempt to use this new XML parser, I get the following error in the console:

bs4.exceptions.FeatureNotFound: Couldn’t find a tree builder with the features you requested: xml. Do you need to install a parser library?

I have tried multiple methods to invoke this new parser, all of which fail with the same error. Here is what I have tried:

soup = BeautifulSoup(xml_content, 'xml')
soup = BeautifulSoup(xml_content, 'lxml')
soup = BeautifulSoup(xml_content, 'lxml-xml')
soup = BeautifulSoup(xml_content, features="xml")

None seem to work. Anyone got any ideas about how to resolve this, or if it is even worth the effort? My reason for NOT using the html parser is the the XML contains <![CDATA[ .... ]]> and the HTML parser doesn’t like this much.

Thanks in advance.

0 Likes

#2

Did you run Package Control: Satisfy Libraries or restartet ST after updating your dependencies.json, so Package Control actully installed requested libraries?

0 Likes

#3

Hi @deathaxe, thanks for your prompt reply. Yes, I had restarted and then received the popup notification that a new library had been installed and to restart for it to take effect (which I did).

I have also run satisfy libraries. It gives a console warning for Python 3.3, but I am running 3.8

The library “lxml” is not available for Python 3.3 on this platform, or this version of Sublime Text

I assume this is from another plugin I have deployed, but I haven’t been able to track back which one. Either way, I think its a red herring.

If I look to my Lib/Python38 folder I can see the following folders were installed yesterday:

Lib/Python38/
|--lxml
|--lxml-5.1.0.dist-info
0 Likes

#4

No import errors in console? What if calling import lxml via ST’s console? Maybe try to run one of its functions manually via console to check whether it works. Maybe the choosen universal2 version of lxml doesn’t work properly.

FWIW, your example code snippets run just fine without any errors on my Windows box.

0 Likes

#5

Hi @deathaxe. From the console, if I run import lxml I get no result (ie - there is no error message, which I take to be a good thing). If I run import lxml.etree (for example), I get an error. I am not smart enough to know if this is a good or bad error. Here is the console output:

>>> import lxml
>>> import lxml.etree
Traceback (most recent call last):
  File "__main__", line 1, in <module>
ImportError: dlopen(/Users/gbird/Library/Application Support/Sublime Text/Lib/python38/lxml/etree.cpython-38-darwin.so, 0x0002): symbol not found in flat namespace '_exsltDateXpathCtxtRegister'

0 Likes