UTF-8 problem

KenDown · Post by **KenDown** » Mon 18 Oct 2021, 15:43

There appears to be a problem with the implementation of UTF-8 which I hope will come out in the program snippet below. The REM'd lines are taken directly from Richard's example program Examples\General\Unicode.bbc nevertheless in the fifth word from the right there is a dotted circle which is definitely not a Hebrew character! The string displays correctly in Notepad and OpenOffice.

Code: Select all

      hebrew$="והארץ היתה תהו ובהו וחשׁך על־פני תהום ורוח אלהים מרחפת על־פני המים׃"
      VDU 23,22,640;512;8,16,16,8 : REM Select UTF-8 mode
      VDU 23,16,2;0;0;0;13 : REM Select right-to-left printing
      *FONT Times New Roman,24,B
      PRINThebrew$
      *FONT Arial,24,B
      PRINThebrew$
      VDU 23,16,0;0;0;0;13 : REM Select left-to-right printing

Looking in Character Map the offending character appears to be U+05C1; it is possible that U+05C2 will also be displayed incorrectly - I haven't checked.

Any work-arounds for the problem will be gratefully received.

DDRM · Post by **DDRM** » Tue 19 Oct 2021, 07:44

Hi Kendall,

Compared with the defined string (which APPEARS to be correctly rendered in the IDE, though I can't say since I don't know the Hebrew alphabet!), the circle of "dots" appears to be an ADDITIONAL "character" - stepping through the string reveals that there seems to be a character there that is NOT appearing - is there some sort of control character in the string?

Best wishes,

D

KenDown · Post by **KenDown** » Thu 21 Oct 2021, 03:17

Yes, the circle of dots is an additional character. The preceding character (ie. the one to the right) is a shape with three prongs pointing up. If there is a dot over the right-hand prong the letter is 'S', if the dot is over the left-hand prong the letter is 'SH'. If you look closely, you will see that above the circle of dots there is a single larger dot.

However the point is that if you copy and paste the Hebrew into something like Word, the dot is positioned correctly and the circle of dots does not appear, but is that because Word does some jiggery-pokery with that character or is the control code built into the character but is not recognised and acted upon by BB4W? And, of course, what do I do about it to get it displaying correctly? Should I scan the string for that character and move it back?

KenDown · Post by **KenDown** » Thu 21 Oct 2021, 03:45

Hmmmm. No, there doesn't appear to be any control characters. Each Hebrew character is made up of two bytes - 215+another number. Thus ALEPH, the first letter in the Hebrew alphabet is 215 144. The dotted circle with the right-hand dot is 215 129, the dotted circle with the left-hand dot is 215 130. In other words, they are perfectly "normal" UTF-8 two-byte characters.

The difficulty is that the dotted circle comes *after* the character it modifies, so I would have to always look one character in advance, if it was 215 129 I would have to print that, then print the previous character (SIN), then skip forward two characters to get to the next displayed character. Alternatively, if I encountered SIN I don't display it until I have checked the next character, displayed it if necessary, then display the SIN on top of it.

And, of course, because it is two-byte characters recognising SIN is not as simple as "IFMID$(A$,I%,1)="!

KenDown · Post by **KenDown** » Thu 21 Oct 2021, 03:52

Mind you, I suppose we can be grateful! Looking at "Character Map" I see that Arabic has 27 of these dottect circles with various dots, strokes and other shapes associated, both above and below the dotted circle. I suspect there must be some way of ignoring the dotted circle and just displaying the shape, so it' not as simple as just over-printing a character on top of the dotted circle character.

DDRM · Post by **DDRM** » Thu 21 Oct 2021, 08:19

Is there a common feature of all these codes (like being 215 xxx?!) that indicate "backspace and the overlay this modifier"? can you do it that way? Then you don't need to "read in advance" - just spot that it is an overlay character and backstep when you print it?

KenDown · Post by **KenDown** » Thu 21 Oct 2021, 19:13

No, they are *all* 215 xxx, so you would have to read 215 169 (or whatever it is for SIN) and then read 215 129 and backspace - but that would print the dotted circle on top of the SIN. I haven't tried it, so perhaps the dotted circle is rendered as transparent or something. A major problem is that instead of printing out a string of characters, you would have print each character in turn and move the cursor forward by whatever the width of that character is (except, of course, for the dotted circle character in which case you print SIN, move forward by the width of SIN, read the next character, realise that it is the dot, move back by the width of SIN (not the width of the dotted circle), print the dotted circle, more forward by the width of SIN and so on.

I'm begnning to think that Jews and Arabs can keep their alphabets!

KenDown · Post by **KenDown** » Sat 23 Oct 2021, 03:54

Some time ago Richard published a function for printing Arabic (you can find it in his Wiki under "Communication and Input-Output) and it occurred to me to see if that might throw any light on the subject. It probably would if I were smart enough to work out what the heck is going on, but as far as I can see, he doesn't use any of the dotted circle characters in the Arabic alphabet.

However, using a few bits from his code, I have come up with the following, which neatly illustrates the problem to which I referred: the dotted circle overwrites; what we need is some way of extracting the bit which is not the dotted circle and using only that bit.

Code: Select all

      VDU 23,22,640;512;8,16,16,128+8
      *FONT Times New Roman, 28
      PROChebrew("השׁמים")
      END

      DEFPROChebrew(A$):LOCALB$
      FORA%=!^A$TO!^A$+LENA$-1
        IF?A%=&D7THEN
          U%=((?A%AND&3F)<<6)+(A%?1AND&3F)
          PRINT?A% ~U%
          CASEU%OF
            WHEN&5C1,&5C2:REM When it is one of the dotted circle characters
              B$+=CHR$8
          ENDCASE
          B$+=CHR$?A%+CHR$A%?1
        ELSE
          U%=0
        ENDIF
      NEXT
      VDU 23,16,2;0;0;0;13
      VDU5:MOVE400,500:PRINTB$:VDU4
      PRINTTAB(3)B$
      VDU 23,16,0;0;0;0;13
      ENDPROC

Note that in VDU5 mode the dotted circle overwrites but what it has overwritten is still visible. In VDU4 mode the dotted circle hides the previous character.

Here is basically the same routine, but this time the dotted circle character is inserted *before* the character which it affects. In VDU4 mode the SIN overwrites and hides the dotted circle, though as a simple CHR$8 does not work correctly with proportional fonts part of the dotted circle is still visible. In VDU5 mode (which is what my Display program requires) the dotted circle is still visible.

Code: Select all

      VDU 23,22,640;512;8,16,16,128+8
      *FONT Times New Roman, 28
      PROChebrew("השׁמים")
      END

      DEFPROChebrew(A$):LOCALB$
      FORA%=!^A$TO!^A$+LENA$-1
        IF?A%=&D7THEN
          U%=((?A%AND&3F)<<6)+(A%?1AND&3F)
          PRINT?A% ~U%
          CASEU%OF
            WHEN&5C1,&5C2:REM When it is one of the dotted circle characters
              B$=LEFT$(B$,LENB$-2)+CHR$?A%+CHR$A%?1+CHR$8+RIGHT$(B$,2)
            OTHERWISE
              B$+=CHR$?A%+CHR$A%?1
          ENDCASE
        ELSE
          U%=0
        ENDIF
      NEXT
      VDU 23,16,2;0;0;0;13
      VDU5:MOVE400,500:PRINTB$:VDU4
      PRINTTAB(3)B$
      VDU 23,16,0;0;0;0;13
      ENDPROC

Note that if you double up on the CHR$8 the dot is in the wrong place (but a different wrong place) in VDU5 mode and is completely overwritten and vanishes in VDU4 mode.

Grrrrrrr.

BBC BASIC forum

UTF-8 problem

UTF-8 problem

Re: UTF-8 problem

Re: UTF-8 problem

Re: UTF-8 problem

Re: UTF-8 problem

Re: UTF-8 problem

Re: UTF-8 problem

Re: UTF-8 problem