▄▄             ▄▄▄  ▄▄▄ Power
█  █ ▄▄▄▄ ▄▄▄▄▄ █  █ █  █
█▄▄█ █▄▄▄ █ █ █ █▀▀▄ █▀▀▄
█  █ ▄▄▄█ █ █ █ █▄▄▀ █▄▄▀

Login
Register
/ aa about.it ad amd64 and.who api asm asmbb asmbb.features authentication bbcode best bugs bulma cares chat common debian decentralization deck design dll docker email embed fast feature files fossil fresh.ide friendly gamedev heap help hiawatha high.cpu i18n ideas incredible interop learning libfresh limit links linux mailing.list meme meta.http-equiv minimag money mysql neo nginx numbers orly os outage pass password post-by-email programmers programming proile read-by-email resources safety script.alert.xss secret seo skins sodom source sourcecode stress.test subdirectory subforum suggestion support tags templates test test123 theme type very.ugly video work xss игнат котики парола русский тест уеб.програмиране хабр.наполеон
Categories Threads

Error detected RSS

0 1 2
KyberMax

However, all this does not affect my opinion on UTF-8 encoding.

All problems with UTF-8 are highly overestimated. You simply need to give it a try. UTF-8 is actually beautiful! And very assembly friendly. It is a sequence of bytes after all.

johnfound

All problems with UTF-8 are highly overestimated.

Maybe. But it depends on what and where to use it - because everything is relative. Yes, the source code of the programs are really compact and compatible with old compilers, since the first 128 characters are the same as in ASCII. But if you include characters with an encoding of 80-7FF, then this will require already 2 bytes per char (as in UTF-16). And for characters 800-FFFF - already 3 bytes per char (and this is already worse than that of UTF-16).

Thus, the effectiveness of this encoding is determined by the set of characters used. But the price for this - variable number of bytes per character is a disadvantage, although, of course, also relative. And if it is still convenient to use for storing and sending text (still depending on the character set), then using this encoding in the editor is stupid, imho.

It is also stupid to use UTF-8 encoding in the OS, where UTF-16 is native.

johnfound

You simply need to give it a try.

UTF-8 (common case when max.string length is unknown):

setTextArrU PROC hdlg:HWND, pausz:PUSZ, nusz:ND, itxt:ID LOCAL lusz: ND LOCAL lwsz: ND LOCAL pwsz: PWSZ sendloop: INVOKE lstrlenA, pausz INC EAX ; + NULL MOV lusz, EAX ; Length of utf8 stringz in bytes ; Get utf16 buffer length INVOKE MultiByteToWideChar, CP_UTF8, NULL, pausz, lusz, NULL, NULL TEST EAX, EAX ; Requred utf16 buffer len.in chars JZ exit ; =0: EPIC FAIL!!11 ADD EAX, EAX ; Chars->bytes MOV lwsz, EAX ; Length of utf16 buffer in bytes ; Allocate utf16 buffer INVOKE LocalAlloc, LMEM_FIXED, lwsz TEST EAX, EAX ; Pointer? JZ exit ; =0: EPIC FAIL!!11 MOV pwsz, EAX ; Pointer for utf16 stringz buffer ; Convert utf8->utf16 INVOKE MultiByteToWideChar, CP_UTF8, NULL, pausz, lusz, pwsz, lwsz ; And only now set text INVOKE SendDlgItemMessageW, hdlg, itxt, WM_SETTEXT, NULL, pwsz ; Free utf16 buffer INVOKE LocalFree, pwsz DEC nusz JZ exit INC itxt MOV EAX, lusz ADD pausz, EAX ; Next string JMP sendloop exit: RET setTextArrU ENDP

UTF-16

setTextArrW PROC hdlg:HWND, pawsz:PWSZ, nwsz:ND, itxt:ID sendloop: INVOKE SendDlgItemMessageW, hdlg, itxt, WM_SETTEXT, NULL, pawsz DEC nwsz JZ exit INC itxt INVOKE lstrlenW, pawsz INC EAX ; Skip 0 ADD EAX, EAX ; Words->Bytes ADD pawsz, EAX ; Next string JMP sendloop exit: RET setTextArrW ENDP
johnfound

UTF-8 is actually beautiful! And very assembly friendly. It is a sequence of bytes after all.

Yes, a sequence of characters of variable length (I would not like to write an editor for this encoding) - I don’t think that UTF-8 is more friendly than a simple sequence of characters of constant length.

It is possible that a good solution would be to create your own internal 4-byte format for the convenience of working with Unicode.

KyberMax

It is possible that a good solution would be to create your own internal 4-byte format for the convenience of working with Unicode.

It has a name: UTF-32. This is the another Unicode encoding that worth. If you really, really want to have fixed length character encoding. But in FreshLib I decided not to use it. Simply because the effort handling UTF-8 is really lower than the memory waste by using UTF-32.

johnfound

It has a name: UTF-32.

I know that there is such a encoding. But I'm talking about the internal format for the most rapid work with large amounts of text. To do this, it is not necessary to do full conversion. For example, your DecodeUtf16 procedure can be shortened to this type:

proc DecodeUtf16, .chars begin mov eax, [.chars] cmp ax, $d800 jb .direct cmp ax, $dc00 jae .direct ; sub ax, $d800 ; movzx ecx, ax ; shr eax, 16 ; shl ecx, 10 ; sub ax, $dc00 ; or eax, ecx mov edx, 4 jmp .decoded ; cmp eax, $110000 ; jb .decoded ; xor eax, eax ; xor edx, edx ; stc ; return .direct: mov edx, 2 and eax, $ffff .decoded: clc return endp

I'm just trying to imagine what is needed for the fastest and easiest processing of Unicode text.

johnfound

But in FreshLib I decided not to use it. Simply because the effort handling UTF-8 is really lower than the memory waste by using UTF-32.

This is because your library is used by you for programs designed to process relatively small amounts of text using an already existing API. For this, this approach is quite acceptable. But if you had to write a text editor without an API or browser, this approach most likely would not suit you.

Yesterday, on your advice, I tried to use UTF-8 in the source code - I had a lot of fun. Where were you before? It really works even for such a product as Masm. Of course, before writing the converter, I tried to find a solution with Unicode, but I only came across examples using macros, strings in resources, and defining character codes via EQU. I also found the MultiByteToWideChar procedure in the documentation, but because of the old documentation and the Masm-package, there was no description of the UTF-8 transcoding flag.

However, even if I found a solution with this encoding, I would still write a converter to support UTF-16, since the solution with this encoding is almost perfect imho.

By the way, for this test I divided the example with unicode into two subprojects: with UTF-8 and with UTF-16, and both blessed by SkyNet.

Attached files:
FileSizeUploadedDownloadsMD5 hash
HelloUcW.7z11176 bytes05.06.201928de02f93c17e7de77cfbbe104010772d5

SkyNet, though blessed to split the project into two, but noticed that two almost identical files are too non-kosher, so I merged two projects into one again, for which I had to slightly modify the converter. Now it has a size of as much as 8 kilobytes and converts to formats: ASCII, UTF-8 (chars or hex.bytes), UTF-16 hex.words and UTF-32 hex.dwords. It also supports overriding formats in the source with the help of prefix letters defining the format as in the command line, but only in capital letters. For example:

AsciiChars CHAR A"This is ASCII string, in chars", 0 Utf8Chars CHAR C"This is UTF-8 string, in chars", 0 Utf8HexBytes CHAR B"This is UTF-8 string, in hex.bytes", 0 Utf16HexWords WCHAR W"This is UTF-16 string, in hex.words", 0 ; Instead of c "L" Utf32HexDwords DCHAR D"This is UTF-32 string, in hex.dwords", 0

Also the "IDE" batch file was rewritten, with which you can choose all these formats.

Attached files:
FileSizeUploadedDownloadsMD5 hash
HelloUcW.7z8538 bytes14.06.201922674c8fa88ec1a55688ccc3ab9c6c7afd
Uasm2asm.7z4167 bytes14.06.20192945ea2ca1282ac30e222d84fcfd8dd3b9

The next extension of the converter's capabilities: now it is possible to save Unicode strings not only in Little Ednian, but also in the Big Endian format. Overriding prefixes are now the same as the uo option parameters:

AsciiChars CHAR a"This is ASCII string, in chars", 0 Utf8Chars CHAR c"This is UTF-8 string, in chars", 0 Utf8HexBytes CHAR b"This is UTF-8 string, in hex.bytes", 0 Utf16LeHexWords WCHAR w"This is UTF-16LE string, in hex.words", 0 Utf16BeHexWords WCHAR W"This is UTF-16BE string, in hex.words", 0 Utf32LeHexDwords DCHAR d"This is UTF-32LE string, in hex.dwords", 0 Utf32BeHexDwords DCHAR D"This is UTF-32BE string, in hex.dwords", 0

The "IDE" on batch file was refined: now the example demonstrates the possibility of creating a cross-platform program in the native unicode encoding of the target OS (Windows and Linux) on MASM Assembler.

Of course, unlike JWASM and UASM, MASM cannot create object files in ELF format. So to get this format, I used the very cool utility objconv by Agner Fog.

Here is a screenshot of the result of the HelloUcW example in Linux console:

https://i.ibb.co/FwB5vNG/Hello-Uc-WLinux-Sky-Net.png

So now you can also not insert the utf-8 to utf-16 conversion code at every output of any string when porting Fresh IDE to Windows (of course, if you use the converter).

In general, it is now possible to easily create a cross-platform code with native unicode encoding for Linux and Windows on MASM, and this undoubtedly speeds up the creation of Russian Linux, and therefore accelerates the advent of SkyNet.

To swim or not to swim, that is the question...

https://i110.fastpic.ru/big/2019/0621/3e/540c463f433d3d5fe0f47e9455a4033e.jpg

Attached files:
FileSizeUploadedDownloadsMD5 hash
Uasm2asm.7z4248 bytes21.06.201923afae96adcedb97da739b6706b27f4b0c
HelloUcW.7z13959 bytes21.06.2019269c02ab8fc3e3a5e3e7e23639f8159968

Now the converter can work with utf-8 source.

Unicode UTF-8/16 To ASCII Assembly Source Converter v1.80 Coded in Masm32 by KyberMax (C:)TeraHard Labs 2015, 2019 Syntax uasm2asm <command> [/|-<option> ...] <UnicodeSrc>[.ext] [AsciiSrc[.ext]] Commands ? : Help c : Convert Options 16 : .RADIX 16 in source (drop h suffix) f : Formatted hex.numbers (don't drop leading zeros) pc <c> : Prefix of Comment directive (one char.only) ph <s> : Prefix of Hexadecimal numbers (two chars max.) rq : Replace string Quotation marks ' by " uo <c> : Unicode string convertion Output format a : Ascii c : utf-8, in Chars b : utf-8, in hex.Bytes w : utf-16le, in hex.Words (default) W : utf-16be, in hex.Words d : utf-32le, in hex.Dwords D : utf-32be, in hex.Dwords us <c> : Unicode format of Source (if no BOM) c : utf-8 w : utf-16le (default) W : utf-16be
Attached files:
FileSizeUploadedDownloadsMD5 hash
Uasm2asm.7z4621 bytes28.06.201919a4cf1f9e4408156c3cf3541ea14adc1e

Now the converter is ported to Linux. It is assembled on Windows 7-64 in Masm32 and works exclusively on INT 80h as SUDDENLY in the Masm32 package there were no Linux libraries.

In general, it was a real antique adventure: what could be more fun than remaking libraries into support mode for two operating systems? But now it is possible to build programs (so far only console programs) for different systems, choosing the necessary system by pressing two buttons.

The converter was tested on a 32-bit Bionic Beaver, which stands on a virtual machine, which stands on Windows 7-64.

Try to reduce a certain amount of code when porting Fresh to Windows.

https://i.ibb.co/HpJSMz6/u2asknt.png

Attached files:
FileSizeUploadedDownloadsMD5 hash
Uasm2asm.7z9522 bytes02.07.2019208f15b4d6d5fef5ab2ef2a08397eb38ab

Hi, johnfound, the cyborg has the Gospel for you: just look at this screenshoot.

https://i89.fastpic.ru/big/2019/0719/f6/7cdc7ccadabe98b45b6aed3d310ee4f6.png

The source is based on Hello example from fasm pack.

; example of simplified Windows programming using complex macro features include 'win32wx.inc' ; you can simply switch between win32ax, win32wx, win64ax and win64wx here include 'encoding\utf8.inc' ; utf8 to utf16 conversion macros CR = $D LF = $A .code start: invoke MessageBox, HWND_DESKTOP, uszHello, "Hello, Fasm Utf16 World!", MB_OK invoke ExitProcess, CR, LF uszHello: uszGreek du "Γεια σου κόσμε!", CR, LF uszItalian du "Ciao, mondo!", CR, LF uszFrench du "Bonjour le monde!", CR, LF uszSpanish du "Hola Mundo!", CR, LF uszEnglish du "Hello, World!", CR, LF uszGerman du "Hallo Welt!", CR, LF uszHindi du "नमस्ते दुनिया!", CR, LF uszArabian du "!مرحبا بالعالم", CR, LF uszHebrew du "!שלום עולם", CR, LF uszChinese du "你好,世界!", CR, LF uszJapanese du "こんにちは世界!", CR, LF uszKorean du "안녕, 세상!", CR, LF uszVietnamese du "Xin Chào, Thế Giới!", CR, LF uszMongolian du "Сайн байна уу, Дэлхийн!", CR, LF uszTurkish du "Selam Dünya!", CR, LF uszAzerbaijani du "Salam, Dünya!", CR, LF uszUzbek du "Salom Dunyo!", CR, LF uszKazakh du "Сәлем Әлем!", CR, LF uszTajik du "Салом, ҶАҲОН!", CR, LF uszFinnish du "Height maailma!", CR, LF uszEstonian du "Tere, Maailm!", CR, LF uszArmenian du "Բարեւ Աշխարհը!", CR, LF uszGeorgian du "გამარჯობა მსოფლიო!", CR, LF uszSerbian du "Здраво, Свет!", CR, LF uszBulgarian du "Здравей, Свят!", CR, LF uszCroatian du "Zdravo, svijete!", CR, LF uszCzech du "Ahoj světe!", CR, LF uszPolish du "Witaj, Świecie!", CR, LF uszUkrainian du "Привіт, Світ!", CR, LF uszBelarusian du "Прывітанне Сусвет!", CR, LF uszRussian du "Дратути!", 0 .end start

I dug out this macro last week. So now you can use native unicode in Windows without any converter or recoding utf-8 to utf-16 on the fly in the program.

It's funny that this macro was written back in 2013 (judging by the file date), but for some reason it hasn’t written an example based on it yet (perhaps because fasm ide does not display unicode).

KyberMax

Hi, johnfound, the cyborg has the Gospel for you: just look at this screenshoot.

Well, I know about this macro, but as I already said, I am considering UTF-16 as a very bad practice and think that working with UTF-8 (and partially with UTF-32 if you really, really need fixed size characters) is much more elegant and correct.

johnfound

...I am considering UTF-16 as a very bad practice and think that working with UTF-8 (and partially with UTF-32 if you really, really need fixed size characters) is much more elegant and correct.

The point here is not a fixed width of characters, but the fact that utf-16 is a native encoding in Windows. Therefore, the use of this encoding is the most elegant and correct (as well as the simplest - because the less code, the fewer errors, is not it?)

johnfound

Well, I know about this macro...

This is also very strange, why have you not written here about this macro before?

KyberMax

The point here is not a fixed width of characters, but the fact that utf-16 is a native encoding in Windows. Therefore, the use of this encoding is the most elegant and correct (as well as the simplest - because the less code, the fewer errors, is not it?)

Currently I am working on FreshLib and it is not Windows library, but portable. And because I am trying to keep the OS dependent layer as small as possible (in order to be easy portable to new systems), the library can't use the native encoding for every OS.

In addition, most of the code of the library is OS independent, so the conversions from/to the native encoding is only when there is a need to pass the strings to the OS API. For example in order to set the window title, or similar. Which does not affect the performance so much.

KyberMax

This is also very strange, why have you not written here about this macro before?

I have never used this macro. Because of the above reasons.

johnfound

...I am working on FreshLib and it is not Windows library, but portable. And because I am trying to keep the OS dependent layer as small as possible (in order to be easy portable to new systems), the library can't use the native encoding for every OS.

I know this, but wouldn't the use of different Unicode macros (or converter) reduce the dependent level?

johnfound

...the conversions from/to the native encoding is only when there is a need to pass the strings to the OS API. For example in order to set the window title, or similar. Which does not affect the performance so much.

Yes, I saw the source code of your library. In my opinion, this is not a matter of performance, but of excess code, which also requires memory allocation. IMHO, this unnecessarily complicates the matter.

johnfound

I have never used this macro...

Well, I asked not about that.

0 1 2
Categories Threads

Error detected RSS

AsmBB v2.7 (check-in: b1b34acbf71dada0); SQLite v3.29.0 (check-in: fc82b73eaac8b369);

©2016..2018 John Found; Licensed under EUPL.
Powered by Assembly language
Created with Fresh IDE

Icons are made by Egor Rumyantsev, vaadin and icomoon from www.flaticon.com