I am looking for names of books and authors in a bunch of texts, like:
my_text = """
My favorites books of all time are:
Harry potter by J. K. Rowling, Dune (first book) by Frank Herbert;
and Le Petit Prince by Antoine de Saint Exupery (I read it many times). That's it by the way.
"""
Right now I am using the following code to split the text on separators like this:
pattern = r" *(.+) by ((?: ?\w+)+)"
matches = re.findall(pattern, my_text)
res = []
for match in matches:
res.append((match[0], match[1]))
print(res) # [('Harry potter', 'J'), ('K. Rowling, Dune (first book)', 'Frank Herbert'), ('and Le Petit Prince', 'Antoine de Saint Exupery '), ("I read it many times). That's it", 'the way')]
Even if there are false positive (like 'that's it by the way') my main problem is with authors that are cut when written as initials, which is pretty common.
I can't figure out how to allow initials like "J. K. Rowling" (or the same without space before / after dot like "J.K.Rowling")
source https://stackoverflow.com/questions/75196453/regex-split-on-but-not-in-substrings-like-j-k-rowling
Comments
Post a Comment