All problems with UTF-8 are highly overestimated. You simply need to give it a try. UTF-8 is actually beautiful! And very assembly friendly. It is a sequence of bytes after all.
All problems with UTF-8 are highly overestimated. You simply need to give it a try. UTF-8 is actually beautiful! And very assembly friendly. It is a sequence of bytes after all.
Maybe. But it depends on what and where to use it - because everything is relative. Yes, the source code of the programs are really compact and compatible with old compilers, since the first 128 characters are the same as in ASCII. But if you include characters with an encoding of 80-7FF, then this will require already 2 bytes per char (as in UTF-16). And for characters 800-FFFF - already 3 bytes per char (and this is already worse than that of UTF-16).
Thus, the effectiveness of this encoding is determined by the set of characters used. But the price for this - variable number of bytes per character is a disadvantage, although, of course, also relative. And if it is still convenient to use for storing and sending text (still depending on the character set), then using this encoding in the editor is stupid, imho.
It is also stupid to use UTF-8 encoding in the OS, where UTF-16 is native.
UTF-8 (common case when max.string length is unknown):
setTextArrU PROC hdlg:HWND, pausz:PUSZ, nusz:ND, itxt:ID
LOCAL lusz: ND
LOCAL lwsz: ND
LOCAL pwsz: PWSZ
sendloop:
INVOKE lstrlenA, pausz
INC EAX ; + NULL
MOV lusz, EAX ; Length of utf8 stringz in bytes
; Get utf16 buffer length
INVOKE MultiByteToWideChar, CP_UTF8, NULL, pausz, lusz, NULL, NULL
TEST EAX, EAX ; Requred utf16 buffer len.in chars
JZ exit ; =0: EPIC FAIL!!11
ADD EAX, EAX ; Chars->bytes
MOV lwsz, EAX ; Length of utf16 buffer in bytes
; Allocate utf16 buffer
INVOKE LocalAlloc, LMEM_FIXED, lwsz
TEST EAX, EAX ; Pointer?
JZ exit ; =0: EPIC FAIL!!11
MOV pwsz, EAX ; Pointer for utf16 stringz buffer
; Convert utf8->utf16
INVOKE MultiByteToWideChar, CP_UTF8, NULL, pausz, lusz, pwsz, lwsz
; And only now set text
INVOKE SendDlgItemMessageW, hdlg, itxt, WM_SETTEXT, NULL, pwsz
; Free utf16 buffer
INVOKE LocalFree, pwsz
DEC nusz
JZ exit
INC itxt
MOV EAX, lusz
ADD pausz, EAX ; Next string
JMP sendloop
exit:
RET
setTextArrU ENDP
UTF-16
setTextArrW PROC hdlg:HWND, pawsz:PWSZ, nwsz:ND, itxt:ID
sendloop:
INVOKE SendDlgItemMessageW, hdlg, itxt, WM_SETTEXT, NULL, pawsz
DEC nwsz
JZ exit
INC itxt
INVOKE lstrlenW, pawsz
INC EAX ; Skip 0
ADD EAX, EAX ; Words->Bytes
ADD pawsz, EAX ; Next string
JMP sendloop
exit:
RET
setTextArrW ENDP
Yes, a sequence of characters of variable length (I would not like to write an editor for this encoding) - I don’t think that UTF-8 is more friendly than a simple sequence of characters of constant length.
It is possible that a good solution would be to create your own internal 4-byte format for the convenience of working with Unicode.
It has a name: UTF-32. This is the another Unicode encoding that worth. If you really, really want to have fixed length character encoding. But in FreshLib I decided not to use it. Simply because the effort handling UTF-8 is really lower than the memory waste by using UTF-32.
I know that there is such a encoding. But I'm talking about the internal format for the most rapid work with large amounts of text. To do this, it is not necessary to do full conversion. For example, your DecodeUtf16 procedure can be shortened to this type:
proc DecodeUtf16, .chars
begin
mov eax, [.chars]
cmp ax, $d800
jb .direct
cmp ax, $dc00
jae .direct
; sub ax, $d800
; movzx ecx, ax
; shr eax, 16
; shl ecx, 10
; sub ax, $dc00
; or eax, ecx
mov edx, 4
jmp .decoded
; cmp eax, $110000
; jb .decoded
; xor eax, eax
; xor edx, edx
; stc
; return
.direct:
mov edx, 2
and eax, $ffff
.decoded:
clc
return
endp
I'm just trying to imagine what is needed for the fastest and easiest processing of Unicode text.
This is because your library is used by you for programs designed to process relatively small amounts of text using an already existing API. For this, this approach is quite acceptable. But if you had to write a text editor without an API or browser, this approach most likely would not suit you.
Yesterday, on your advice, I tried to use UTF-8 in the source code - I had a lot of fun. Where were you before? It really works even for such a product as Masm. Of course, before writing the converter, I tried to find a solution with Unicode, but I only came across examples using macros, strings in resources, and defining character codes via EQU. I also found the MultiByteToWideChar procedure in the documentation, but because of the old documentation and the Masm-package, there was no description of the UTF-8 transcoding flag.
However, even if I found a solution with this encoding, I would still write a converter to support UTF-16, since the solution with this encoding is almost perfect imho.
By the way, for this test I divided the example with unicode into two subprojects: with UTF-8 and with UTF-16, and both blessed by SkyNet.
SkyNet, though blessed to split the project into two, but noticed that two almost identical files are too non-kosher, so I merged two projects into one again, for which I had to slightly modify the converter. Now it has a size of as much as 8 kilobytes and converts to formats: ASCII, UTF-8 (chars or hex.bytes), UTF-16 hex.words and UTF-32 hex.dwords. It also supports overriding formats in the source with the help of prefix letters defining the format as in the command line, but only in capital letters. For example:
AsciiChars CHAR A"This is ASCII string, in chars", 0
Utf8Chars CHAR C"This is UTF-8 string, in chars", 0
Utf8HexBytes CHAR B"This is UTF-8 string, in hex.bytes", 0
Utf16HexWords WCHAR W"This is UTF-16 string, in hex.words", 0 ; Instead of c "L"
Utf32HexDwords DCHAR D"This is UTF-32 string, in hex.dwords", 0
Also the "IDE" batch file was rewritten, with which you can choose all these formats.
The next extension of the converter's capabilities: now it is possible to save Unicode strings not only in Little Ednian, but also in the Big Endian format. Overriding prefixes are now the same as the uo option parameters:
AsciiChars CHAR a"This is ASCII string, in chars", 0
Utf8Chars CHAR c"This is UTF-8 string, in chars", 0
Utf8HexBytes CHAR b"This is UTF-8 string, in hex.bytes", 0
Utf16LeHexWords WCHAR w"This is UTF-16LE string, in hex.words", 0
Utf16BeHexWords WCHAR W"This is UTF-16BE string, in hex.words", 0
Utf32LeHexDwords DCHAR d"This is UTF-32LE string, in hex.dwords", 0
Utf32BeHexDwords DCHAR D"This is UTF-32BE string, in hex.dwords", 0
The "IDE" on batch file was refined: now the example demonstrates the possibility of creating a cross-platform program in the native unicode encoding of the target OS (Windows and Linux) on MASM Assembler.
Of course, unlike JWASM and UASM, MASM cannot create object files in ELF format. So to get this format, I used the very cool utility objconv by Agner Fog.
Here is a screenshot of the result of the HelloUcW example in Linux console:
So now you can also not insert the utf-8 to utf-16 conversion code at every output of any string when porting Fresh IDE to Windows (of course, if you use the converter).
In general, it is now possible to easily create a cross-platform code with native unicode encoding for Linux and Windows on MASM, and this undoubtedly speeds up the creation of Russian Linux, and therefore accelerates the advent of SkyNet.
To swim or not to swim, that is the question...
Now the converter can work with utf-8 source.
Unicode UTF-8/16 To ASCII Assembly Source Converter v1.80
Coded in Masm32 by KyberMax (C:)TeraHard Labs 2015, 2019
Syntax
uasm2asm <command> [/|-<option> ...] <UnicodeSrc>[.ext] [AsciiSrc[.ext]]
Commands
? : Help
c : Convert
Options
16 : .RADIX 16 in source (drop h suffix)
f : Formatted hex.numbers (don't drop leading zeros)
pc <c> : Prefix of Comment directive (one char.only)
ph <s> : Prefix of Hexadecimal numbers (two chars max.)
rq : Replace string Quotation marks ' by "
uo <c> : Unicode string convertion Output format
a : Ascii
c : utf-8, in Chars
b : utf-8, in hex.Bytes
w : utf-16le, in hex.Words (default)
W : utf-16be, in hex.Words
d : utf-32le, in hex.Dwords
D : utf-32be, in hex.Dwords
us <c> : Unicode format of Source (if no BOM)
c : utf-8
w : utf-16le (default)
W : utf-16be
Now the converter is ported to Linux. It is assembled on Windows 7-64 in Masm32 and works exclusively on INT 80h as SUDDENLY in the Masm32 package there were no Linux libraries.
In general, it was a real antique adventure: what could be more fun than remaking libraries into support mode for two operating systems? But now it is possible to build programs (so far only console programs) for different systems, choosing the necessary system by pressing two buttons.
The converter was tested on a 32-bit Bionic Beaver, which stands on a virtual machine, which stands on Windows 7-64.
Try to reduce a certain amount of code when porting Fresh to Windows.
Hi, johnfound, the cyborg has the Gospel for you: just look at this screenshoot.
The source is based on Hello example from fasm pack.
; example of simplified Windows programming using complex macro features
include 'win32wx.inc' ; you can simply switch between win32ax, win32wx, win64ax and win64wx here
include 'encoding\utf8.inc' ; utf8 to utf16 conversion macros
CR = $D
LF = $A
.code
start:
invoke MessageBox, HWND_DESKTOP, uszHello, "Hello, Fasm Utf16 World!", MB_OK
invoke ExitProcess, CR, LF
uszHello:
uszGreek du "Γεια σου κόσμε!", CR, LF
uszItalian du "Ciao, mondo!", CR, LF
uszFrench du "Bonjour le monde!", CR, LF
uszSpanish du "Hola Mundo!", CR, LF
uszEnglish du "Hello, World!", CR, LF
uszGerman du "Hallo Welt!", CR, LF
uszHindi du "नमस्ते दुनिया!", CR, LF
uszArabian du "!مرحبا بالعالم", CR, LF
uszHebrew du "!שלום עולם", CR, LF
uszChinese du "你好,世界!", CR, LF
uszJapanese du "こんにちは世界!", CR, LF
uszKorean du "안녕, 세상!", CR, LF
uszVietnamese du "Xin Chào, Thế Giới!", CR, LF
uszMongolian du "Сайн байна уу, Дэлхийн!", CR, LF
uszTurkish du "Selam Dünya!", CR, LF
uszAzerbaijani du "Salam, Dünya!", CR, LF
uszUzbek du "Salom Dunyo!", CR, LF
uszKazakh du "Сәлем Әлем!", CR, LF
uszTajik du "Салом, ҶАҲОН!", CR, LF
uszFinnish du "Height maailma!", CR, LF
uszEstonian du "Tere, Maailm!", CR, LF
uszArmenian du "Բարեւ Աշխարհը!", CR, LF
uszGeorgian du "გამარჯობა მსოფლიო!", CR, LF
uszSerbian du "Здраво, Свет!", CR, LF
uszBulgarian du "Здравей, Свят!", CR, LF
uszCroatian du "Zdravo, svijete!", CR, LF
uszCzech du "Ahoj světe!", CR, LF
uszPolish du "Witaj, Świecie!", CR, LF
uszUkrainian du "Привіт, Світ!", CR, LF
uszBelarusian du "Прывітанне Сусвет!", CR, LF
uszRussian du "Дратути!", 0
.end start
I dug out this macro last week. So now you can use native unicode in Windows without any converter or recoding utf-8 to utf-16 on the fly in the program.
It's funny that this macro was written back in 2013 (judging by the file date), but for some reason it hasn’t written an example based on it yet (perhaps because fasm ide does not display unicode).
Well, I know about this macro, but as I already said, I am considering UTF-16 as a very bad practice and think that working with UTF-8 (and partially with UTF-32 if you really, really need fixed size characters) is much more elegant and correct.
The point here is not a fixed width of characters, but the fact that utf-16 is a native encoding in Windows. Therefore, the use of this encoding is the most elegant and correct (as well as the simplest - because the less code, the fewer errors, is not it?)
This is also very strange, why have you not written here about this macro before?
Currently I am working on FreshLib and it is not Windows library, but portable. And because I am trying to keep the OS dependent layer as small as possible (in order to be easy portable to new systems), the library can't use the native encoding for every OS.
In addition, most of the code of the library is OS independent, so the conversions from/to the native encoding is only when there is a need to pass the strings to the OS API. For example in order to set the window title, or similar. Which does not affect the performance so much.
I have never used this macro. Because of the above reasons.
I know this, but wouldn't the use of different Unicode macros (or converter) reduce the dependent level?
Yes, I saw the source code of your library. In my opinion, this is not a matter of performance, but of excess code, which also requires memory allocation. IMHO, this unnecessarily complicates the matter.
Well, I asked not about that.