1

It seems like a fairly simple task, yet I can't find a fast and reliable solution to it.

I have strings in bash, and I want to know the number of characters that will be printed on the terminal. The reason I need this, is to nicely align the strings in three columns of n characters each. For that, I need to add as many "space" as necessary to make sure the second and third columns always starts at the same location in the terminal.

Example of problematic string length:

v='féé'

echo "${#v1}"
 > # 5 (should be 3)

printf '%s' "${v1}" | wc -m
 > # 5 (should be 3)

printf '%s' "${v1}" | awk '{print length}'
 > # 5 (should be 3)

The best I have found is this, that works most of the time.

echo "${v}" | python3 -c 'v=input();print(len(v))'
 > # 3 (yeah!)

But sometimes, I have characters that are modified by the following sequences. I can't copy/past that here, but this is how it looks like:

v="de\314\201tresse"
echo "${v}"
 > # détresse
echo "${v}" | python3 -c 'v=input();print(len(v))'
 > # 9 (should be 8)

I know it can be even more complicated with \r character or ANSI sequences, but I am only going to have to deal with "regular" strings that can be commonly found in filenames, documents and other file content writing by humans. Since the string IS printed in the terminal, I guess there must be some engine that knows or can know the printed length of the string.

I have also considered the possible solution of sending ANSI sequence to get the position of the cursor in the terminal before and after printing the string, and use the difference to compute the length, but it looks like a rabbit hole I don't want to dig. Plus it will be very slow.

8
  • 2
    I don;t know if this is contributing to your problems or not but be aware that echo will add a newline so the output of echo 'foo' is 4 characters long, not 3 as you might expect. You could use printf '%s' 'foo' instead but then the output is no longer a valid text "file" since it doesn't have a terminating newline so YMMV with what any text processing tool does with it so - read the man page for whatever tool you use to determine the length if you go that route rather than just subtracting 1. Commented Sep 5, 2023 at 20:35
  • @EdMorton But input() in Python removes the newline. Commented Sep 5, 2023 at 20:45
  • 2
  • Be careful what you ask. "Length of a string" does not mean "apparent number of occupied columns onscreen". The "length of a string" maybe means the "number of characters", but sometimes means the number of bytes, which is a very different thing. féé looks like three characters, but those 's are not e's. It takes two bytes to print that one character, though if you print out the individual bytes, one looks like an e, and the other doesn't generally print, since it's an "overstrike". Be super careful in your terminology - and I bet I botched mine here somewhere, lol. Tricky! Commented Sep 6, 2023 at 4:41
  • 2
    length is (5, 3, 7, 10, 12, 3), ie number of characters, number of graphemes, number of bytes (UTF-8), number of bytes (utf-16-le/be), number of bytes (utf-16), number of terminal cells) Commented Sep 11, 2023 at 7:28

4 Answers 4

2

How about

v='féé'
echo "${v}" | python3 -c 'import unicodedata as ud;v=input();print(len(ud.normalize("NFC",v)))'

If you have trouble installing with

pip install unicodedata

try unicodedata2

Additional Notes

This will normalize strings to utf-8 according to the NFC standard explained here. If you are working with Latin ANSI, then it should work fine. However, for pre-Unicode ANSI encodings of languages such as Arabic, Greek, Hebrew, Russian or Thai, then NFC may keep the original formatting. Although it is generally more advisable to use NFC, you could try NFKC in those cases. The reason for preferring NFC is to avoid normalizing symbols that are compatible but not canonically equivalent, for example the single character ff (U+FB00): if you normalize it with NFC, it is length 1, but if you normalize it with NFKC, that's length 2. Depending on your application that can create some issues, but if you just want readable text, then NFKC is fine.

Sign up to request clarification or add additional context in comments.

6 Comments

You should note the limitations of this approach.
I will update it with some clarifications
NFKC seems to work just fine with Latin AINSI, is there a reason to prefer NFC?
I've added some explanation about that. it's about unicode equivalence.
@LucasMouraGomes, you do not install unicodedata via pip, it is part of your Python install.
|
1

To get the number of terminal cells used by a string, it is possible to use wcswidth. There is a Python implementation for wcwidth and wcswidth.

With Python install wcwidth:

pip install wcwidth

And the Python code would be:

from wcwidth import wcswidth
v = 'féé'
print(wcswidth(v))
# 3

It will also yield the correct result for NFD:

v = ud.normalize("NFD",v)
print(wcswidth(v))
# 3

Additionally it will correctly handle wide characters, i.e. characters that take up 2 terminal cells per character:

v='中文'
print(wcswidth(v))
# 4

And adapting Lucas' solution above, for the terminal:

v='féé'
echo "${v}" | python3 -c 'from wcwidth import wcswidth;v=input();print(wcswidth(v))'

Comments

0

With Perl:

Without modules:

perl -CSAD -E 'say length($ARGV[0])' été
3

With utf8::all module:

perl -Mutf8::all -E 'say length($ARGV[0])' été
3

1 Comment

It does not look to work well with combined characters like in de\314\201tresse (output is 9, should be 8).
0

Using grep and wc:

$ v="de\314\201tresse"
$ printf "%s" "$v" | grep -o '[a-z]' | wc -l
8

$ v='féé'
$ printf "%s" "$v" | grep -o '[a-z]' | wc -l
3

1 Comment

Thanks, that's a very simple solution without any additional tools. It does require to list all the "valid" characters though, one way or another grep -i -o '[a-z0-9 +special characters]'. I am not sure why [a-z] matches with é though, how is that an expected behavior?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.