I have a large .txt file containing numerous email addresses, but it also contains many unnecessary "\n" characters. I want to extract only the email addresses and remove any other characters.
To accomplish this, I have written a small script in Python.
import re
filename = "input.txt"
output_filename = "output.txt"
email_regex = r'\s*([A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,})\s*'
with open(filename, "r") as f, open(output_filename, "w") as out:
for line in f:
emails = re.findall(email_regex, line)
for email in emails:
out.write(email + "\n")
While the script successfully extracted regular email addresses, it encountered some difficulties with certain formats.
As an example, suppose I have a line of data that reads "CC\nexample@example.com\n". When I run my code, the resulting output is "nexample@example.com", which is not what I intended. Rather, I would like the output to be "example@example.com" without the leading "n" character."
Next, I tested another small Python script for a single email address, and the results were successful.
import re
string = "CC\nexample@example.com\n"
email_regex = r'[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}'
email = re.search(email_regex, string).group()
print(email)
So I want to get same result from a large file. If you have a solution for this, it would be good for me.
source https://stackoverflow.com/questions/76095385/how-can-i-remove-the-n-character-in-a-large-file-with-python
Comments
Post a Comment