China DOS Union

-- Unite DOS · Advance DOS · Grow DOS --

Union site: www.cn-dos.net Forum site: www.cn-dos.net/forum
DOS stands for freedom, openness and progress. Let us work hard, learn from the openness and GNU spirit of FreeDOS and Linux, and together build and grow a free GNU GPL world!

中国DOS联盟论坛
The time now is 2026-06-25 06:50
中国DOS联盟论坛 » DOS批处理 & 脚本技术(批处理室) » [Original] GBK & UTF8 Encoding Conversion Script (CMD + GAWK) DigestI View 17,600 Replies 31
Original Poster Posted 2006-11-30 00:31 ·  中国 浙江 宁波 鹏博士宽带
荣誉版主
★★★
Credits 1,338
Posts 356
Joined 2005-07-15 12:09
20-year member
UID 40733
Gender Male
Status Offline
GBK & UTF8 Encoding Conversion Script (CMD+GAWK)

Because in my actual use I needed UTF8-to-GBK encoding conversion, I wrote one with GAWK. In fact, I had already been using it in an earlier post. This time I整理ed it a bit, and it now supports bidirectional encoding conversion. I made a complete GBK-to-UTF8 conversion mapping table myself. During the process I found that there are quite a few differences between the system's conversion results and iconv's conversion results, and the mapping table uses the former.
This script supports encoding conversion through pipes and files, with the result output to the screen. There are not many parameters supported yet, but it does have parameter integrity checking and can handle unordered calls with multiple parameters. It uses a new script release method: when the source file is modified, the script will be updated automatically. There are also error messages and a dependency-file integrity checking mechanism. I hope these little tricks will be helpful to everyone in writing batch files.
GAWK download link: http://www.klabaster.com/progs/gawk32.zip
The script and conversion table are in the attachment.


  1. :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
  2. :: gbk2utf8.cmd -V0.1 -- GBK & UTF8 encoding conversion
  3. :: 无奈何@cn-dos.net - 2006-11-28 - CMD & GAWK
  4. :: Usage: gbk2utf8 /I file...
  5. :: Supported files: - gawk.exe gbk2utf8.dat
  6. :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
  7. @echo off
  8. setlocal
  9. set self="%~f0"
  10. set AwkScript="%temp%\%~n0%~z0.awk"
  11. set path=%path%;%~dp0;%cd%
  12. set nofile=
  13. set error=
  14. set input=

  15. ::Dependency file integrity check
  16. for %%i in (gawk.exe gbk2utf8.dat) do (
  17. @if "%%~$PATH:i" == "" (
  18. echo.The required dependency file "%%i" is missing.
  19. set nofile=1
  20. ) else ( set %%~ni="%%~$PATH:i" )
  21. )
  22. if defined nofile goto :EOF
  23. ::Update script after file changes
  24. if not exist %AwkScript% (
  25. del /q "%temp%\%~n0*.awk" 2>nul
  26. gawk "/^#<-1/,/^#>-1/{if(!/^#/)print}" %self% >%AwkScript%
  27. )

  28. :ParseLoop
  29. if "%~1" == "" goto Start
  30. if "%~1" == "?" goto SwitchH
  31. if "%~1" == "/?" goto SwitchH
  32. rem Process parameters and jump to the corresponding label.
  33. for %%s in (U u I i h H) do if "%~1"=="/%%s" goto Switch%%s
  34. if "%F_input%" == "1" (
  35. if not exist "%~1" set error=Warning: file "%~1" does not exist. & goto error
  36. set input=%input% "%~1"
  37. shift
  38. goto ParseLoop
  39. )
  40. if "%F_input%" == "-1" shift & goto ParseLoop
  41. set error=Error: incorrect parameter format - "%1" !
  42. goto error

  43. :SwitchI
  44. set F_input=1
  45. if "%~2" == "-" set F_input=-1
  46. shift
  47. goto ParseLoop

  48. :SwitchU
  49. set F=-1
  50. shift
  51. goto ParseLoop

  52. :error
  53. echo.%error%
  54. echo.
  55. :SwitchH
  56. echo.gbk2utf8 V0.1 -- GBK ^& UTF8 encoding conversion
  57. echo.
  58. echo.Usage: 1、%~n0
  59. echo. 2、%~n0 /I file...
  60. echo. 3、%~n0 /I -
  61. echo.
  62. echo.Options: /? displays this brief help, equivalent to /H .
  63. echo. /U converts UTF8 to GBK; the default is GBK to UTF8.
  64. echo. /I specifies the file to convert; “-” gets it from standard output.
  65. echo. This parameter may be omitted; by default it will be obtained from standard output.
  66. echo. When specifying a file to convert, the /I parameter cannot be omitted.
  67. goto :EOF

  68. :Start
  69. if "%input%" == "" set F_input=-1
  70. if "%F_input%" == "-1" (
  71. gawk -v F=%F% -f %AwkScript%
  72. ) else (
  73. gawk -v F=%F% -f %AwkScript% %input%
  74. )
  75. goto :EOF

  76. :AwkScript
  77. #<-1
  78. function gbk2utf8(string,flag, reg, gbkreg, utf8reg, char, result){
  79. gbkreg="|"
  80. utf8reg="||\xe0||\xf0|"
  81. reg=gbkreg
  82. if (flag==-1)
  83. reg=utf8reg
  84. RLENGTH = 1
  85. while(RLENGTH != -1){
  86. match(string,reg)
  87. char=substr(string,RSTART,RLENGTH)
  88. if (RLENGTH>1)
  89. char=charset
  90. result=result char
  91. string=substr(string,RSTART+RLENGTH)
  92. }
  93. return result
  94. }

  95. BEGIN {
  96. FS=","
  97. if (!F) F=1
  98. if (F==1) {
  99. while((getline<"gbk2utf8.dat") > 0)
  100. charset=$2
  101. }
  102. else{
  103. while((getline<"gbk2utf8.dat") > 0)
  104. charset=$1
  105. }
  106. close("gbk2utf8.dat")
  107. }
  108. {
  109. x=gbk2utf8($0,F)
  110. print x
  111. }
  112. #>-1
  113. goto :EOF
Posted by 无奈何 2006-11-30 01:02


[ Last edited by 无奈何 on 2006-11-30 at 02:04 PM ]
Attachments
gbk2utf8.zip (102.8 KiB, Credits to download 1 pts, Downloads: 259)
  ☆开始\运行 (WIN+R)☆
%ComSpec% /cset,=何奈无── 。何奈可无是原,事奈无做人奈无&for,/l,%i,in,(22,-1,0)do,@call,set/p= %,:~%i,1%<nul&ping/n 1 127.1>nul

Floor 2 Posted 2006-11-30 00:38 ·  中国 河北 廊坊 三河市 移动
金牌会员
★★★★
Credits 2,725
Posts 1,160
Joined 2006-09-23 12:00
19-year member
UID 63486
From 河北廊坊
Status Offline
Strong post, leaving a mark. (Brother electronixtar, please don't mind the infringement ^_^) Brother Wunaihe is a mountain I can never climb over!
三人行,必有吾师焉。 学然后知不足,教然后知困,然后能自强也。
Floor 3 Posted 2006-11-30 00:42 ·  中国 广西 玉林 博白县 电信
金牌会员
★★★★
Credits 3,687
Posts 1,467
Joined 2005-08-08 12:00
20-year member
UID 44210
Status Offline
Previously, I found a GBK-Unicode encoding comparison table (with more than 7,000 comparison characters) online, and I used batch processing to calculate and generate a GBK-UTF-8 encoding comparison table. As long as I use the dictionary lookup for method, I can instantly extract the UTF-8 encoding. Let me look for it...
Floor 4 Posted 2006-11-30 00:46 ·  中国 甘肃 甘南藏族自治州 合作市 电信
金牌会员
★★★★
Credits 4,103
Posts 1,744
Joined 2006-01-20 13:00
20-year member
UID 49241
Gender Male
From 甘肃.临泽
Status Offline
Hey, I still don't understand that gawk part. I'm learning it.
Floor 5 Posted 2006-11-30 00:48 ·  中国 北京 联通
金牌会员
★★★★
Credits 2,902
Posts 1,147
Joined 2006-09-21 12:00
19-year member
UID 63324
Gender Male
Status Offline
So wonderful~~~
    Redtek,一个永远在网上流浪的人……

_.,-*~'`^`'~*-,.__.,-*~'`^`'~*-,._,_.,-*~'`^`'~*-,._,_.,-*~'`^`'~*-,._
Floor 6 Posted 2006-11-30 00:51 ·  中国 浙江 宁波 鹏博士宽带
荣誉版主
★★★
Credits 1,338
Posts 356
Joined 2005-07-15 12:09
20-year member
UID 40733
Gender Male
Status Offline
ccwan, you're being too modest. I'm not a pro either.
zxcv, brother, more than 7,000 are of GB2312, the complete GBK has over 20,000, with a total of 22,046 code points.
  ☆开始\运行 (WIN+R)☆
%ComSpec% /cset,=何奈无── 。何奈可无是原,事奈无做人奈无&for,/l,%i,in,(22,-1,0)do,@call,set/p= %,:~%i,1%<nul&ping/n 1 127.1>nul

Floor 7 Posted 2006-11-30 00:58 ·  中国 浙江 宁波 鹏博士宽带
荣誉版主
★★★
Credits 1,338
Posts 356
Joined 2005-07-15 12:09
20-year member
UID 40733
Gender Male
Status Offline
RE vkill
GAWK supports multi-byte encoding. The advantage of this processing method is that as long as you write a regular expression that matches the encoding, it can be generally used for conversion of other encodings. It should be said that it is relatively easy to understand. The only thing to do is to obtain the character length and intercept it.
  ☆开始\运行 (WIN+R)☆
%ComSpec% /cset,=何奈无── 。何奈可无是原,事奈无做人奈无&for,/l,%i,in,(22,-1,0)do,@call,set/p= %,:~%i,1%<nul&ping/n 1 127.1>nul

Floor 8 Posted 2006-11-30 01:52 ·  中国 广西 玉林 博白县 电信
金牌会员
★★★★
Credits 3,687
Posts 1,467
Joined 2005-08-08 12:00
20-year member
UID 44210
Status Offline
Originally posted by 无奈何 at 2006-11-29 12:51:
Brother ccwan is being modest, and I'm not a expert either.
Brother zxcv, the 7k+ ones are of GB2312, the complete GBK has over 20k, with a total of 22046 code points.

Indeed, it seems I need to find a complete GBK to UTF-8 conversion table.
Floor 9 Posted 2006-11-30 02:21 ·  中国 广西 玉林 博白县 电信
金牌会员
★★★★
Credits 3,687
Posts 1,467
Joined 2005-08-08 12:00
20-year member
UID 44210
Status Offline
Have obtained the complete gbk2utf8 encoding library

Found that there are many different data between this gbk2utf8 encoding library and the GB version I converted:
For example:
gbk2utf8
A1A42C C2B7
A1A52C CB89
A1A62C CB87
A1A72C C2A8

The red part is the identification of gbk

gb2utf8
a1a4 E383BB
a1a5 E08B89
a1a6 E08B87
a1a7 E082A8


It seems that there are some incorrect code comparisons in the GB part of this gbk2utf8 encoding library, and no errors have been found in the codes beyond the library GB part for the time being
Floor 10 Posted 2006-11-30 03:45 ·  中国 浙江 宁波 鹏博士宽带
荣誉版主
★★★
Credits 1,338
Posts 356
Joined 2005-07-15 12:09
20-year member
UID 40733
Gender Male
Status Offline
RE zxcv

My one has a separator "," added in the middle, which is 0X2C.

Here is a piece of AWK script I wrote to generate all GBK characters. Just save it as UTF8 to get a comparison table of the two encodings.


BEGIN{
#GBK/1: A1A1-A9FE Graphic symbol area GB 2312 non-Chinese character symbol area
for (i=0xa1;i<=0xa9;i++)
for (j=0xa1;j<=0xfe;j++)
if (j!=0x7f) printf("%c%c\n",i,j)
#GBK/2: B0A1-F7FE Chinese character area GB 2312 Chinese character area
for (i=0xb0;i<=0xf7;i++)
for (j=0xa1;j<=0xfe;j++)
if (j!=0x7f) printf("%c%c\n",i,j)
#GBK/3: 8140-A0FE Chinese character area
for (i=0x81;i<=0xa0;i++)
for (j=0x40;j<=0xfe;j++)
if (j!=0x7f) printf("%c%c\n",i,j)
#GBK/4: AA40-FEA0 Chinese character area
for (i=0xaa;i<=0xfe;i++)
for (j=0x40;j<=0xa0;j++)
if (j!=0x7f) printf("%c%c\n",i,j)
#GBK/5: A840-A9A0 Graphic symbol area
for (i=0xa8;i<=0xa9;i++)
for (j=0x40;j<=0xa0;j++)
if (j!=0x7f) printf("%c%c\n",i,j)
}



[ Last edited by 无奈何 on 2006-11-30 at 03:50 AM ]
  ☆开始\运行 (WIN+R)☆
%ComSpec% /cset,=何奈无── 。何奈可无是原,事奈无做人奈无&for,/l,%i,in,(22,-1,0)do,@call,set/p= %,:~%i,1%<nul&ping/n 1 127.1>nul

Floor 11 Posted 2006-11-30 04:23 ·  中国 广西 玉林 博白县 电信
金牌会员
★★★★
Credits 3,687
Posts 1,467
Joined 2005-08-08 12:00
20-year member
UID 44210
Status Offline
Only the following codes in the GB part cannot be found corresponding ones

A8BB C991
A8BD C584
A8BE C588
A8C0 C9A1


First try it ^_^
Floor 12 Posted 2006-11-30 07:01 ·  中国 四川 成都 教育网
铂金会员
★★★★
Credits 7,493
Posts 2,672
Joined 2005-09-02 00:00
20-year member
UID 42173
Gender Male
Status Offline
I didn't understand it, so I'll give it a thumbs up first

C:\>BLOG http://initiative.yo2.cn/
C:\>hh.exe ntcmds.chm::/ntcmds.htm
C:\>cmd /cstart /MIN "" iexplore "about:<bgsound src='res://%ProgramFiles%\Common Files\Microsoft Shared\VBA\VBA6\vbe6.dll/10/5432'>"
Floor 13 Posted 2006-12-03 02:56 ·  中国 广东 东莞 电信
银牌会员
★★★
Credits 1,179
Posts 442
Joined 2006-09-09 22:47
19-year member
UID 62249
Status Offline
Too profound. Learning.
Floor 14 Posted 2006-12-03 02:59 ·  中国 北京 朝阳区 联通
高级用户
★★
朦胧的世界
Credits 579
Posts 218
Joined 2006-10-24 04:29
19-year member
UID 67972
Status Offline
Looks like C++

认识自己,降伏自己,改变自己
,才能改变别人!
Floor 15 Posted 2006-12-03 09:31 ·  中国 北京 东城区 联通
金牌会员
★★★★
Credits 2,902
Posts 1,147
Joined 2006-09-21 12:00
19-year member
UID 63324
Gender Male
Status Offline
GAWK is really powerful~~~
    Redtek,一个永远在网上流浪的人……

_.,-*~'`^`'~*-,.__.,-*~'`^`'~*-,._,_.,-*~'`^`'~*-,._,_.,-*~'`^`'~*-,._
Forum Jump: