Sublime Forum

Regex: Match and Remove particular repetead-words from tags

#1

hello. I have this text with many   under the tag <p class="my_class">

 <p class="my_class">An Extension&nbsp;of Java for Event Correlation. 571 geographical/logical coordinates, or sources. Henceforth,&nbsp;we will use the term&nbsp;events to refer to&nbsp;both the incidents underlying such&nbsp;events as well as to their incarnations&nbsp;and notifications</p>

I want to select this tag and replace all &nbsp; with empty space.

First of all, I select the tag and the content: (?s)<p class="my_class">([^<]*)</p>

Then I try to include &nbsp; into this regex formula so as to select only &nbsp;

(?s)<p class="my_class">.*?&nbsp;([^<]*)</p> but does’t work. Can anyone help me?

0 Likes

#2

I’ve give my usual advice - keep it simple and performant.

Keep your first working search of (?s)<p class="my_class">([^<]*)</p>, click Find All, then do a new search, “in selection” for &nbsp;

or, for better selection of the p tag contents (as this regex will not work if the paragraph contains child elements), don’t use regex to parse HTML, but try the right tool for the job instead like:
https://packagecontrol.io/packages/xpath
to select the tags, then do the replace in selection

0 Likes

#3

yes, but I need to change in 100 html pages :slight_smile: that’s the problem…I need fo make a search and replace in more then 100 html pages

0 Likes

#4

ah I see, why not just

(<p class="my_class">[^&<]*+)&nbsp;

replace with \1

although you will need to execute it as many times as &nbsp; appears in the tag’s inner HTML

unfortunately there’s not really a better way, though maybe you reduce the number of times it needs to be executed by adding something like (?:([^&<]*)&nbsp;)? to the end and replacing with \1\2

0 Likes

#5

your regex, (<p class="my_class">[^&<]*+)&nbsp; replace with \1 will find and replace only the first instance of &nbsp;

But I need to select and replace all &nbsp; from the inside of tag

0 Likes

#6

which is why I recommended to execute the replacement multiple times, until there are no matches - or to duplicate parts of the expression so you can replace multiple capture groups at once…

otherwise, what you want is impossible without using a regex engine like .NET’s that stores all captured text that matched for a capture group:
https://www.regular-expressions.info/captureall.html

you could probably get clever using \G though, and skip the start of the file:

(?:(?!\A)\G|(<p class="my_class">))([^&<]*+)&nbsp;

replace with \1\2 space

0 Likes

#7

I can not handle it :frowning:

0 Likes

#8

This is something that is really best done with a mix of regular expression and coded logic. A pure regex solution is pretty much impossible.

So for a quick example, I’ll use application I wrote called Rummage to illustrate the logic. First, we would use this pattern:

(<p class="my_class">)([^<]+)

And use a little Python code:

from rummage.lib import rumcore


class NbspReplace(rumcore.ReplacePlugin):
    def replace(self, m):
        return m.group(1) + m.group(2).replace('&nbsp;', ' ')


def get_replace():
    return NbspReplace

And you can see the results:

Essentially you can put similar logic in a script and make the changes. Or use something like the plugin RegReplace and apply it to your files. Often if I’m making changes across multiple files, I’ll use Rummage as it will crawl folders and such, and I don’t have to rewrite all that logic. But I’m sure there are other things out there that you can use that can do the same thing.

0 Likes

#9

SEARCH: (?-si)<p class="(my_class)".+?</p>(?!#)|(&nbsp;)(?=.+#)|#

REPLACE: (?1$0#)(?2\x20)

IMPORTANT : you’ll have to click TWICE, on the Replace All button

0 Likes

#10

and another answer:

SEARCH: (?:\G(?!^)|<p\s+class="my_class">)(?:(?!</p>).)*?\K&nbsp;
Replace BY: Leave Space

or

SEARCH: (?-s)(\G(?!^)|<p\s+class="text_obisnuit">)((?!</p>).)*?\K&nbsp;
Replace BY: Leave Space

0 Likes