Sublime Forum

Regex for strings beginning with capital letters

#1

I have a huge spreadsheet column filled with data like this:

Abutilon abutiloides (Jacq.) Garcke ex Hochr.
Abutilon lignosum (Cav.) G. Don
Abietinella abietina (Hedw.) Fleisch.
Hypnum L. abietinum Hedw.
Thuidium abietinum (Hedw.) Schimp.
Abronia alpina Brandegee
Abies Rogers 2009 alba Mill.
Abies amabilis (Douglas ex Loudon) Douglas ex Forbes

I want to delete all the extraneous text, leaving me with a list of scientific names, like this:

Abutilon abutiloides
Abutilon lignosum
Abietinella abietina
Hypnum abietinum
Thuidium abietinum
Abronia alpina
Abies alba
Abies amabilis

This regular expression deletes parentheses and their contents:

([^()]*)

The next step is to delete every string beginning with a capital letter preceded by a space. In other words, things like these should be deleted:

[space]L.
[space]Rogers
[space]Williams2009

ChatGPT gave me the following regex’s. The first one blew up in my face. The others are a little better, but they still don’t do the job.

\s[A-Z][^\s]*

(\b[A-Z][a-z]+ [a-z]+)(?: [A-Z][^\s]*)+

(\b[A-Z][a-z]+ [a-z]+)(?: [A-Z][^\s])

Can anyone give me a regex that will purge my file of all strings beginning with capital letters and preceded by a space? Thanks!

0 Likes

#2

I think you would want something like this:
( [A-Z][a-z0-9.]+)

I really like using regexr for help with these, here you can see it applying to your examples: https://regexr.com/88rbt

1 Like

#3

Thanks for the tip, but that didn’t work for me. It deleted just about everything, including strings starting with lower case letters. It also inserted the character & - lots of them. I wonder if there might be something wrong with my Sublime program. One of the regex’s I got from ChatGPT replaced the text with thousands of &'s.

0 Likes

#4

Hmm, I am not sure what would cause &s to be inserted, but it looks like when matching regex like this you would also need to select the match case button.

1 Like

#5

This can be done with python:
bear.py

import re

names = [
    "Abutilon abutiloides (Jacq.) Garcke ex Hochr.",
    "Abutilon lignosum (Cav.) G. Don",
    "Abietinella abietina (Hedw.) Fleisch.",
    "Hypnum L. abietinum Hedw.",
    "Thuidium abietinum (Hedw.) Schimp.",
    "Abronia alpina Brandegee",
    "Abies Rogers 2009 alba Mill.",
    "Abies amabilis (Douglas ex Loudon) Douglas ex Forbes"
]

pattern = r"([A-Z][a-z]+)\s([a-z]+)"

result = [re.match(pattern, name).group(0) if re.match(pattern, name) 
          else re.match(r"([A-Za-z]+)\s([a-z]+)", name.split()[0] + " " + next(word for word in name.split()[1:] if word.islower())).group(0)
          for name in names]

# Print the result
for name in result:
    print(name)

Result:

Abutilon abutiloides
Abutilon lignosum
Abietinella abietina
Hypnum abietinum
Thuidium abietinum
Abronia alpina
Abies alba
Abies amabilis

** Process exited - Return Code: 0 **

or with Javascript, demo

1 Like

#6

Oh my God, I never even thought about that. I thought the regex itself was designed to do that. Anyway, it took me a minute to find the case button, but I tried it, and it looks like it worked perfectly. Thanks so much for the tip.

P.S. I just found a few stings that weren’t deleted, but they appear to be very few and far between, and a couple start with foreign characters. Even with a few misses, this will still save me hours of time.

0 Likes

#7

Thanks. Those might come in handy, because I’m still encountering a few quirks. I don’t have a clue about Python, but I have a little experience with JavaScript.

0 Likes