2

I am trying to parse a text document line by line and in doing so I stumbled onto some weird behavior which I believe is caused by the presence of some kind of ankh symbol (☥). I am not able to copy the real symbol here. In my code I try to determine whether a '+' symbol is present in the first characters of each line. To see if this worked I added a print statement containing a boolean and this string.

The relevant part of my code:

with open(file_path) as input_file:
    content = input_file.readlines()
    for line in content:
        plus = '+' in line[0:2]
        print('Plus: {0}, line: {1}'.format(plus,line))

A file I could try to parse:

+------------------------------
row 1 with some content
+------+------+-------+-------
☥+------+------+-------+------
|  col 1 | col 2 | col 3 ...
+------+------+-------+-------
|_ valu | val |    |   dsf |..
|_ valu | valu | ...

What I get as output:

Plus: True, line: +------------------------------

Plus: False, line: row 1 with some content

Plus: True, line: +------+------+-------+-------

♀+------+------+-------+------

Plus: False, line: | col 1 | col 2 | col 3 ...

Plus: True, line: +------+------+-------+-------

Plus: False, line: |_ valu | val | | dsf |..

Plus: False, line: |_ valu | valu | ...

So my question is why does it just print the line containing the symbol without the 'Plus: True/False'. How should I solve this? Thanks.

10
  • 1
    I just tried to reproduce this with the same sequence of input lines and didn't get any repeated lines. Commented Mar 3, 2017 at 12:36
  • 1
    Maybe your lines have a \r character in them. Try printing the repr version of them. Commented Mar 3, 2017 at 12:40
  • Mm I did have to insert a unicode symbol in here because I can't seem to copy the real symbol. Commented Mar 3, 2017 at 12:40
  • 1
    @spijs here you have it, \r resets caret to line beginning. Commented Mar 3, 2017 at 12:50
  • 1
    You may want to process it or not, but in ASCII, '\x0c' is the code for form feed. It means that the program that has created it intended to start a new page there. Commented Mar 3, 2017 at 12:59

1 Answer 1

1

What you are seeing is the gender symbol. It is from the original IBM PC character set and is encoded as 0x0c, aka FormFeed, aka Ctrl-L.

If you are parsing text data with these present, they likely were inserted to indicate to a printer to start a new page.

From wikipedia:

Form feed is a page-breaking ASCII control character. It forces the printer to eject the current page and to continue printing at the top of another. Often, it will also cause a carriage return. The form feed character code is defined as 12 (0xC in hexadecimal), and may be represented as control+L or ^L.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.