Sublime Forum

Remove duplicated domains in text

#1

Hi all,

I know there is a way to filter all permutes to get the unique one, but my question is a little more than that. Supposed I have a list of urls as below:

http://www.abc.com/123
http://www.def.com/223
http://www.abc.com/456
http://www.def.com/556
http://www.qwe.com/667

Given that the first and third urls are on the same domain (also second and forth), but the path is different. Is it anyway to remove all of the permutations from the same domain and keep the first one only? I expect the result to be:

http://www.abc.com/123
http://www.def.com/223
http://www.qwe.com/667

Thank you.

0 Likes

#2

I suggest writing a small script in a language of your choice (e.g. Python).

0 Likes

#3

thanks for your reply, I was just looking for a native answer without using any other programming languages.

0 Likes

#4

If you are okay with going through each line manually, there is a possibility, but depending on the file size this would quickly become infeasable.

For that, you would select the host url of a line (e.g. http://www.abc.com/), press Alt+F3, deselect the line you initially selected (using alt+leftclick drag) and then delete the remaining selected lines with Ctrl+Shift+d. Repeat this for every line.

1 Like

#5

why not just sort it (Edit -> Sort Lines) then perform a regex find and replace:

Find What: ^(http://[^/]+/)(.*$\n)((\1)(?2))+
Replace With: $1$2

3 Likes

#6

That’s a good idea. I assumed the line order must not be changed and thus didn’t consider this option, but if it can be this would be a good way.

The only “problem” that you need to run the replacement multiple times because of 1) potential overlaps with an odd number of the same hosts and 2) more than 2 times of a host, which would need to be reduced by half again.

Edit: I didn’t notice the + at the end, so you can disregard this post.

0 Likes

#7

nope :slight_smile: try it :slightly_smiling:

http://www.abc.com/123
http://www.abc.com/456
http://www.def.com/223
http://www.def.com/556
http://www.def.com/602
http://www.ghi.com/700
http://www.ghi.com/731
http://www.qwe.com/667
http://www.qwe.com/667
http://www.qwe.com/667

turns into

http://www.abc.com/123
http://www.def.com/223
http://www.ghi.com/700
http://www.qwe.com/667
0 Likes

#8
http://www.abc.com/123
http://www.def.com/223
http://www.ghi.com/700
http://www.qwe.com/667
http://www.qwe.com/667

I used “Replace all” of course, since continuously pressing “Replace” for 1000 lines is nothing to sweat at.

2 Likes

#9

oh I see, I should make the final newline optional:

^(http://[^/]+/)(.*$\n?)((\1)(?2))+
1 Like

#10

thank @kingkeith, your solution is what i needed.

0 Likes