0

I have a .txt file with cyrillic text where a lot of lines end with a short hyphen (-). I want these removed, but without removing the hyphens anywhere else in the file.

Have made this thus far, where my idea is to line by line in file f1 copy the text into f2, without a hyphen at the end.

f2 = open('n_dim.txt','w')
with open('dim.txt','r',encoding='utf-8') as f1:
    for line in f1:
        f2.write(line.removesuffix('-'))

Currently receiving zero errors. I managed to copy the file content, but the hyphens persist. How can I properly remove them?

2
  • 2
    Side note: you should use with open() on both files: with open('dim.txt', 'r', encoding='utf-8') as f1, open('n_dim.txt', 'w') as f2:. Commented Mar 2, 2023 at 22:28
  • 1
    Normally should use the same utf-8 encoding on both files as well. The default encoding is OS-dependent. Commented Mar 2, 2023 at 23:20

1 Answer 1

1

The reason this is not working as intended is that each line that you get while iterating over a file pointer includes the \n or \r\n at the end of each line. We can see that by adding a print of the repr of each line while iterating over the file.

I will use the following example file content for the rest of the answer:

Hello-there-
Привет--
Hello-

If we print the repr of each line, we can see:

with open('dim.txt', 'r', encoding='utf-8') as f_in:
    for line in f_in:
        print(repr(line))

->

'Hello-there-\n'
'Привет--\n'
'Hello-\n'

To fix this, we can strip all whitespace at the end of each line before calling removesuffix:

with open('dim.txt', 'r', encoding='utf-8') as f_in:
    with open('n_dim.txt', 'w', encoding='utf-8') as f_out:
        for line in f_in:
            f_out.write(line.rstrip().removesuffix('-') + '\n')

This results in the following:

Hello-there
Привет-
Hello

Note that if there may be more than 1 trailing dash per line and you want to remove all trailing dashes, then you would need to use rstrip instead:

with open('dim.txt', 'r', encoding='utf-8') as f_in:
    with open('n_dim.txt', 'w', encoding='utf-8') as f_out:
        for line in f_in:
            f_out.write(line.rstrip().rstrip('-') + '\n')

This results in the following:

Hello-there
Привет
Hello

If you need to support opening the file in older Windows programs, then you would need to use + '\r\n' instead of + '\n' when writing the output.

If the input file is small enough, another approach would be to read the whole file and use splitlines once instead of rstrip on each line. Using splitlines would preserve any other trailing whitespace, while rstrip will remove it. Example:

with open('dim.txt', 'r', encoding='utf-8') as f_in:
    with open('n_dim.txt', 'w', encoding='utf-8') as f_out:
        for line in f_in.read().splitlines():
            f_out.write(line.rstrip('-') + '\n')
Sign up to request clarification or add additional context in comments.

4 Comments

Re: '\r\n' and Windows. The files are opened in text mode, where '\n' is translated to '\r\n' on write, so the second-to-last paragraph is incorrect. Just write '\n' in all cases.
If it's written from Linux for Windows consumption, then it would be necessary.
True, but if written from Windows \r\n will be translated to \r\r\n. Where does the OP mention Linux?
I was just trying to provide that for completeness... I would prefer to pretend \r\n doesn't exist...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.