China DOS Union

-- Unite DOS · Advance DOS · Grow DOS --

Union site: www.cn-dos.net Forum site: www.cn-dos.net/forum
DOS stands for freedom, openness and progress. Let us work hard, learn from the openness and GNU spirit of FreeDOS and Linux, and together build and grow a free GNU GPL world!

中国DOS联盟论坛
The time now is 2026-06-25 07:52
中国DOS联盟论坛 » 网络日志(Blog) » National Standard GB18030-2005 "Information Technology - Chinese Coding Character Set" View 9,601 Replies 75
Floor 61 Posted 2016-06-26 19:55 ·  中国 海南 海口 电信
超级版主
★★★★
Credits 3,673
Posts 2,020
Joined 2016-02-01 00:00
10-year member
UID 181465
Gender Male
Status Offline
Displaying the HZK16 Chinese character font in C language

http://www.cnblogs.com/hoodlum1980/articles/1079944.html




Reading HZK16 in C language

/**
* HZK Chinese character dot matrix, thanks to netizen Gao Jinshan for selfless sharing
*
* Organized by: http://jdgcs.org/
* Technical details: http://jdgcs.org/HZK16
* jdgcs.org
*/
#include <stdio.h>
#include <stdlib.h>

int main()
{
unsigned char incode = "我"; // The Chinese character to be read, GB encoding
unsigned char qh = 0, wh = 0;
unsigned long offset = 0;
char mat = {0};
FILE *HZK = 0;
int i,j,k;

// Each Chinese character occupies two bytes, get its area number and position number
qh = incode - 0xa0; // Get the area code
wh = incode - 0xa0; // Get the position code
offset = (94 * (qh - 1) + (wh - 1)) * 32; // Get the offset position
if((HZK = fopen("hzk16", "rb")) == NULL)
{
printf("Can't Open hzk16\n");
getchar();
return 0;
}
fseek(HZK, offset, SEEK_SET);
fread(mat, 32, 1, HZK);
fclose(HZK);

// Display
for(i=0; i<16; i++)
{
for(j=0; j<2; j++)
{
for(k=0; k<8; k++)
{
if(mat & (0x80 >> k))
{// If the tested bit is 1, display '#'
printf("%c", '#');
}
else
{
printf("%c", '-');
}
}
}
printf("\n");
}
getchar();
return 1;
}
{CODE}


{CODE(caption="C language reading HZK16, reference version",wrap="1",colors="c")}
// HZK Chinese character dot matrix, thanks to netizen hfhrman for selfless sharing
// Organized and released by http://jdgcs.org
int i,j,k;
unsigned char incode = "我"; // The Chinese character to be read
unsigned char qh,wh;
unsigned long offset;
// Occupies two bytes, get its area number and position number
qh = incode - 0xa0; // Get the area code
wh = incode - 0xa0; // Get the position code
offset = (94 * (qh - 1) + (wh - 1)) * 32; // * Get the offset position * /

FILE *HZK;
char mat;
if((HZK = fopen("hzk16", "rb")) == NULL)
{
printf("Can't Open hzk16\n");
exit(0);
}
fseek(HZK, offset, SEEK_SET);
fread(mat, 32, 1, HZK);

// Display

for(j=0;j<16;j++)
for(i=0;i<2;i++)
for(k=0;k<8;k++)
if(mat & (0x80 >> k)) /* If the tested bit is 1, display */
{

printf("%s", '#');

} else { printf("%s", '-');

}


fclose(HZK);
fclose(fp);

[ Last edited by zzz19760225 on 2016-12-11 at 22:39 ]
1<词>,2,3/段\,4{节},5(章)。
Floor 62 Posted 2016-06-26 19:56 ·  中国 海南 海口 电信
超级版主
★★★★
Credits 3,673
Posts 2,020
Joined 2016-02-01 00:00
10-year member
UID 181465
Gender Male
Status Offline
https://baike.baidu.com/item/CMap/2334917?fr=aladdin

CMap Editing
CMap is a function in the computer language, which functions as the key of the mapping.
English Name CMap Function Used as the key of the mapping Includes Classes of KEY Objects Belongs to Data type used by parameter KEY

Table of Contents
1 Parameters
2 Explanation
▪ Constructor
▪ Operators
▪ Status
3 Usage of CMap

Parameter Editing
Classes of KEY objects, used as the key of the mapping. ARG_KEY data type used by parameter KEY, usually a reference of KEY. VALUE classes of objects stored in the mapping. ARG_VALUE data type used by parameter VALUE, usually a reference of VALUE.

Explanation Editing
CMap is a dictionary collection class that maps unique keys to values. Once a key-value pair (element) is inserted into the mapping, these keys can be used to effectively obtain or delete the pairs. Similarly, all elements in the mapping can be reused repeatedly.
The POSITION type variable is used to replace the entries of all mapping variables. POSITION can be used to "remember" the entries and traversal in the mapping. You may think that this traversal is carried out in sequence through key values, but actually it is not. The order of obtaining elements is not determined.
Some member functions of this class call global helper functions, which must be customized to meet more uses of the CMap class. Please refer to the "Macro and Global" part in the "Microsoft Visual C++ MFC Library Reference" for "Collection Class Helpers".
CMap introduces the macro IMPLEMENT_SERIAL, which supports serialization and dumping of its elements. If the mapping is stored in an archive file, each element can be serialized one by one using the loading insertion (<<) operator or the Serialize member function. If you want to understand the diagnostic dumping of individual elements in the mapping, the depth of the dumped content must be 1 or greater. When the CMap object is deleted or its elements are deleted, both the key and the value will be deleted. The derivation of the mapping class is similar to the derivation of the list.
Please refer to the "Collections" part in the online document "Visual C++ Programmer's Guide" to further understand the derivation of special-purpose list classes.
#include <afxtempl.h>
Members of the CMap class
Constructor
CMap constructs a collection where keys are mapped to values.
Operators
Lookup finds the value corresponding to the specified key; SetAt inserts an element into the mapping, but if a matching key is found, the existing element is replaced; operator inserts an element into the mapping, which is an operator instead of SetAt; RemoveKey deletes the element specified by the key; RemoveAll deletes all elements in the mapping; GetStartPosition returns the position of the first element; GetNextAssoc gets the next element in the cycle; GetHashTableSize returns the size of the hash table (number of elements); InitHashTable initializes the hash table and specifies its size.
Status
GetCount returns the number of elements in the mapping; IsEmpty tests whether it is an empty mapping (i.e., has no elements).

Usage of CMap Editing
In the final analysis, CMap stores data with CPair, and the form of CPair is {KEY, VALUE}. Therefore, CMap actually stores KEY, not ARG_KEY. However, if you refer to the code of MFC, you will find that the parameters of almost all CMap member functions are marked with ARG_KEY and ARG_VALUE types. So, using KEY& as the type of ARG_KEY is usually correct, unless:
1. You use atomic data types such as int, char. At this time, there is no difference between value passing and reference passing (even value passing is faster).
2. If you use CString as the key (KEY) type, you should use LPCTSTR as the type of ARG_KEY instead of using CString&. The reason will be explained later.

How do I use CMap for my own class ClassX? As I just mentioned, CMap is a Hash Map. Hash Map requires that each element has a Hash value - a function about KEY. Hash Map uses this value as the index of the hash table. If the Hash values of multiple KEYs are the same, they will be stored in the form of a linked list. So, the first thing you need to do is provide a Hash function.
CMap will call the template function HashKey() to calculate the Hash value.
The default implementation and the specialized implementations for LPCSTR and LPCWSTR are as follows:
// inside <afxtemp.h>
template<class ARG_KEY>
AFX_INLINE UINT AFXAPI HashKey(ARG_KEY key)
...{
// default identity hash - works for most primitive values
return (DWORD)(((DWORD_PTR)key)>>4);
}
// inside <strcore.cpp>
// specialized implementation for LPCWSTR
#if _MSC_VER >= 1100
template<> UINT AFXAPI HashKey<LPCWSTR> (LPCWSTR key)
#else
UINT AFXAPI HashKey(LPCWSTR key)
#endif
...{
UINT nHash = 0;
while (*key)
nHash = (nHash<<5) + nHash + *key++;
return nHash;
}
// specialized implementation for LPCSTR
#if _MSC_VER >= 1100
template<> UINT AFXAPI HashKey<LPCSTR> (LPCSTR key)
#else
UINT AFXAPI HashKey(LPCSTR key)
#endif
...{
UINT nHash = 0;
while (*key)
nHash = (nHash<<5) + nHash + *key++;
return nHash;
}
As you can see, the default behavior will "assume" that KEY is a pointer and convert it to DWORD type. This is why you will get the "error C2440: ''type cast'': cannot convert from ''ClassXXX'' to ''DWORD_PTR''" error when you do not provide a specialized HashKey() for your ClassX.
At the same time, because MFC only implements the specialization of LPCSTR and LPCWSTR, and does not implement the specialization of CStringA and CStringW, so if you want to use CString as the key type of CMap, you should declare it as CMap<CString, LPCTSTR, ……>.

Okay, now you know how CMap calculates the Hash value. But because there may be multiple keys with the same Hash value, CMap needs to traverse the entire linked list to find the required data, not just in the same Hash value. And when CMap performs matching, it will call CompareElements(), which is another template function.
// inside <afxtemp.h>
// noted: when called from CMap,
// TYPE=KEY, ARG_TYPE=ARG_TYPE
// and note pElement1 is TYPE*, not TYPE
template<class TYPE, class ARG_TYPE>
BOOL AFXAPI CompareElements(const TYPE* pElement1,
const ARG_TYPE* pElement2)
...{
ASSERT(AfxIsValidAddress(pElement1,
sizeof(TYPE), FALSE));
ASSERT(AfxIsValidAddress(pElement2,
sizeof(ARG_TYPE), FALSE));
// for CMap<CString, LPCTSTR...>
// we are comparing CString == LPCTSTR
return *pElement1 == *pElement2;
}
Therefore, if you want to use CMap for your own class ClassX, you must provide specialized implementations of HashKey() and CompareElements().

Example: CMap used for CString* As an example, the following illustrates what you need to do before using CMap for CString*. Of course, the value of the string is used as the key (KEY), not the address of the pointer.
template<>
UINT AFXAPI HashKey<CString*> (CString* key)
...{
return (NULL == key) ? 0 : HashKey((LPCTSTR)(*key));
}
// I don''t know why, but CompareElements can''t work with CString*
// have to define this
typedef CString* LPCString;
template<>
BOOL AFXAPI CompareElements<LPCString, LPCString>
(const LPCString* pElement1, const LPCString* pElement2)
...{
if ( *pElement1 == *pElement2 ) ...{
// true even if pE1==pE2==NULL
return true;
} else if ( NULL != *pElement1 && NULL != *pElement2 ) ...{
// both are not NULL
return **pElement1 == **pElement2;
} else ...{
// either one is NULL
return false;
}
}

The main function is as follows:
int _tmain(int argc, TCHAR* argv, TCHAR* envp)
...{
CMap<CString*, CString*, int, int> map;
CString name1 = "Microsoft";
CString name2 = "Microsoft";
map = 100;
int x = map;
printf("%s = %d ", (LPCTSTR)name1, x);*/
return 0;
}
--------- console output ---------
Microsoft = 100
Note that even if you do not provide specialized implementations of HashKey() and CompareElements(), the compiler will not report an error, but in this case the output is 0, which may not be what you want.

Summary of CMap CMap is a Hash Map and STL::map is a Tree Map. It is meaningless to compare the efficiencies of the two (it is like comparing apples and oranges!). But if you want to obtain keywords in order, you need to use STL::map.
The design of HashKey() is the key to efficiency. You should provide a HashKey() with a low collision rate (i.e., different keywords generate the same Hash value) and easy to calculate (not like MD5). We must pay attention to this - at least for some classes - it is not an easy task.
When using CMap (and STL::hash_map), pay attention to the size of the hash table. Quoting a note from MSDN: "The size of the hash table should be a prime number. To reduce collisions, the size of the hash table should exceed 20% of the maximum expected capacity. By default, the hash table size of CMap is 17, which is suitable for data with about 10 keywords. You can use InitHashTable(UINT nHashSize) to change the size of the hash table, and you can only do this before adding the first element. You may find many prime numbers here. (Do not confuse with CMap(UINT nBlockSize), nBlockSize is used to obtain multiple CAssoc to speed up the creation of new nodes.)

[ Last edited by zzz19760225 on 2017-12-11 at 19:53 ]
1<词>,2,3/段\,4{节},5(章)。
Floor 63 Posted 2016-06-26 19:56 ·  中国 海南 海口 电信
超级版主
★★★★
Credits 3,673
Posts 2,020
Joined 2016-02-01 00:00
10-year member
UID 181465
Gender Male
Status Offline
Top 1000 Most Commonly Used Chinese Characters Frequency Ranking-----Author: Zhong Li
https://www.thn21.com/base/zi/17300.html

Top 1000 Most Commonly Used Chinese Characters Frequency Ranking

There are only more than 3000 commonly used Chinese characters. The national standard GB2312-80 "Chinese Character Set for Information Interchange * Basic Set" is formulated according to the frequency of use.

The first-level character set is commonly used characters, 3755 in total, and the second-level character set is less commonly used characters, 3008 in total. There are 6763 Chinese characters in total for the first and second-level character sets.

The characters in the first-level character set have a total frequency of use of 99.7%. That is, in every ten thousand Chinese characters in modern Chinese materials, these characters will appear more than 9970 times, and all other Chinese characters are less than 30 times. The most commonly used 1000 Chinese characters have a frequency of use of more than 90%.

Top 5 most frequently used characters (sum of frequencies is 10%):

的 一 是 了 我

Top (6~17) most frequently used characters (sum of frequencies is 10%):

不 人 在 他 有 这 个 上 们 来 到 时

Top (18~42) most frequently used characters (sum of frequencies is 10%):

大 地 为 子 中 你 说 生 国 年 着 就 那 和 要 她 出 也 得 里 后 自 以 会

Top (43~79) most frequently used characters (sum of frequencies is 10%):

家 可 下 而 过 天 去 能 对 小 多 然 于 心 学 么 之 都 好 看 起 发 当 没 成 只 如 事 把 还 用 第 样 道 想 作 种 开 (The sum of the frequencies of these 36 Chinese characters is 10%)

Top (80~140) most frequently used characters (sum of frequencies is 10%):

美 总 从 无 情 己 面 最 女 但 现 前 些 所 同 日 手 又 行 意 动 方 期 它 头 经 长 儿 回 位 分 爱 老 因 很 给 名 法 间 斯 知 世 什 两 次 使 身 者 被 高 已 亲 其 进 此 话 常 与 活 正 感

Top 141-232 most frequently used characters (sum of frequencies of these 92 characters is 10%)

见 明 问 力 理 尔 点 文 几 定 本 公 特 做 外 孩 相 西 果 走 将 月 十 实 向 声 车 全 信 重 三 机 工 物 气 每 并 别 真 打 太 新 比 才 便 夫 再 书 部 水 像 眼 等 体 却 加 电 主 界 门 利 海 受 听 表 德 少 克 代 员 许 稜 先 口 由 死 安 写 性 马 光 白 或 住 难 望 教 命 花 结 乐 色

Top 233-380 most frequently used characters (148 characters, sum of frequencies is 10%)

更 拉 东 神 记 处 让 母 父 应 直 字 场 平 报 友 关 放 至 张 认 接 告 入 笑 内 英 军 候 民 岁 往 何 度 山 觉 路 带 万 男 边 风 解 叫 任 金 快 原 吃 妈 变 通 师 立 象 数 四 失 满 战 远 格 士 音 轻 目 条 呢 病 始 达 深 完 今 提 求 清 王 化 空 业 思 切 怎 非 找 片 罗 钱 紶 吗 语 元 喜 曾 离 飞 科 言 干 流 欢 约 各 即 指 合 反 题 必 该 论 交 终 林 请 医 晚 制 球 决 窢 传 画 保 读 运 及 则 房 早 院 量 苦 火 布 品 近 坐 产 答 星 精 视 五 连 司 巴

382-500 (5.43%)

奇 管 类 未 朋 且 婚 台 夜 青 北 队 久 乎 越 观 落 尽 形 影 红 爸 百 令 周 吧 识 步 希 亚 术 留 市 半 热 送 兴 造 谈 容 极 随 演 收 首 根 讲 整 式 取 照 办 强 石 古 华 諣 拿 计 您 装 似 足 双 妻 尼 转 诉 米 称 丽 客 南 领 节 衣 站 黑 刻 统 断 福 城 故 历 惊 脸 选 包 紧 争 另 建 维 绝 树 系 伤 示 愿 持 千 史 谁 准 联 妇 纪 基 买 志 静 阿 诗 独 复 痛 消 社 算

501-631

算 义 竟 确 酒 需 单 治 卡 幸 兰 念 举 仅 钟 怕 共 毛 句 息 功 官 待 究 跟 穿 室 易 游 程 号 居 考 突 皮 哪 费 倒 价 图 具 刚 脑 永 歌 响 商 礼 细 专 黄 块 脚 味 灵 改 据 般 破 引 食 仍 存 众 注 笔 甚 某 沉 血 备 习 校 默 务 土 微 娘 须 试 怀 料 调 广 蜖 苏 显 赛 查 密 议 底 列 富 梦 错 座 参 八 除 跑 亮 假 印 设 线 温 虽 掉 京 初 养 香 停 际 致 阳 纸 李 纳 验 助 激 够 严 证 帝 饭 忘 趣 支

632-1000

春 集 丈 木 研 班 普 导 顿 睡 展 跳 获 艺 六 波 察 群 皇 段 急 庭 创 区 奥 器 谢 弟 店 否 害 草 排 背 止 组 州 朝 封 睛 板 角 况 曲 馆 育 忙 质 河 续 哥 呼 若 推 境 遇 雨 标 姐 充 围 案 伦 护 冷 警 贝 著 雪 索 剧 啊 船 险 烟 依 斗 值 帮 汉 慢 佛 肯 闻 唱 沙 局 伯 族 低 玩 资 屋 击 速 顾 泪 洲 团 圣 旁 堂 兵 七 露 园 牛 哭 旅 街 劳 型 烈 姑 陈 莫 鱼 异 抱 宝 权 鲁 简 态 级 票 怪 寻 杀 律 胜 份 汽 右 洋 范 床 舞 秘 午 登 楼 贵 吸 责 例 追 较 职 属 渐 左 录 丝 牙 党 继 托 赶 章 智 冲 叶 胡 吉 卖 坚 喝 肉 遗 救 修 松 临 藏 担 戏 善 卫 药 悲 敢 靠 伊 村 戴 词 森 耳 差 短 祖 云 规 窗 散 迷 油 旧 适 乡 架 恩 投 弹 铁 博 雷 府 压 超 负 勒 杂 醒 洗 采 毫 嘴 毕 九 冰 既 状 乱 景 席 珍 童 顶 派 素 脱 农 疑 练 野 按 犯 拍 征 坏 骨 余 承 置 臓 彩 灯 巨 琴 免 环 姆 暗 换 技 翻 束 增 忍 餐 洛 塞 缺 忆 判 欧 层 付 阵 玛 批 岛 项 狗 休 懂 武 革 良 恶 恋 委 拥 娜 妙 探 呀 营 退 摇 弄 桌 熟 诺 宣 银 势 奖 宫 忽 套 康 供 优 课 鸟 喊 降 夏 困 刘 罪 亡 鞋 健 模 败 伴 守 挥 鲜 财 孤 枪 禁 恐 伙 杰 迹 妹 藸 遍 盖 副 坦 牌 江 顺 秋 萨 菜 划 授 归 浪 听 凡 预 奶 雄 升 碃 编 典 袋 莱 含 盛 济 蒙 棋 端 腿 招 释 介 烧 误

According to the sampling statistics of the State Press and Publication Administration, there are 560 most commonly used Chinese characters, 807 commonly used characters, and 1033 less commonly used characters. The total of the three is 2400, accounting for 99% of the characters used in general books and periodicals. Therefore, if primary school students know 2400 commonly used characters, they can read general books and periodicals.

[ Last edited by zzz19760225 on 2018-3-21 at 22:30 ]
1<词>,2,3/段\,4{节},5(章)。
Floor 64 Posted 2016-06-26 19:57 ·  中国 海南 海口 电信
超级版主
★★★★
Credits 3,673
Posts 2,020
Joined 2016-02-01 00:00
10-year member
UID 181465
Gender Male
Status Offline
1<词>,2,3/段\,4{节},5(章)。
Floor 65 Posted 2016-06-26 19:58 ·  中国 海南 海口 电信
超级版主
★★★★
Credits 3,673
Posts 2,020
Joined 2016-02-01 00:00
10-year member
UID 181465
Gender Male
Status Offline
1<词>,2,3/段\,4{节},5(章)。
Floor 66 Posted 2016-06-26 19:59 ·  中国 海南 海口 电信
超级版主
★★★★
Credits 3,673
Posts 2,020
Joined 2016-02-01 00:00
10-year member
UID 181465
Gender Male
Status Offline
1<词>,2,3/段\,4{节},5(章)。
Floor 67 Posted 2016-06-26 20:00 ·  中国 海南 海口 电信
超级版主
★★★★
Credits 3,673
Posts 2,020
Joined 2016-02-01 00:00
10-year member
UID 181465
Gender Male
Status Offline
1<词>,2,3/段\,4{节},5(章)。
Floor 68 Posted 2016-06-26 20:00 ·  中国 海南 海口 电信
超级版主
★★★★
Credits 3,673
Posts 2,020
Joined 2016-02-01 00:00
10-year member
UID 181465
Gender Male
Status Offline
1<词>,2,3/段\,4{节},5(章)。
Floor 69 Posted 2016-06-26 20:01 ·  中国 海南 海口 电信
超级版主
★★★★
Credits 3,673
Posts 2,020
Joined 2016-02-01 00:00
10-year member
UID 181465
Gender Male
Status Offline
1<词>,2,3/段\,4{节},5(章)。
Floor 70 Posted 2016-06-26 20:05 ·  中国 海南 海口 电信
超级版主
★★★★
Credits 3,673
Posts 2,020
Joined 2016-02-01 00:00
10-year member
UID 181465
Gender Male
Status Offline
1<词>,2,3/段\,4{节},5(章)。
Floor 71 Posted 2016-06-26 20:07 ·  中国 海南 海口 电信
超级版主
★★★★
Credits 3,673
Posts 2,020
Joined 2016-02-01 00:00
10-year member
UID 181465
Gender Male
Status Offline
1<词>,2,3/段\,4{节},5(章)。
Floor 72 Posted 2016-06-26 20:08 ·  中国 海南 海口 电信
超级版主
★★★★
Credits 3,673
Posts 2,020
Joined 2016-02-01 00:00
10-year member
UID 181465
Gender Male
Status Offline
1<词>,2,3/段\,4{节},5(章)。
Floor 73 Posted 2016-06-26 20:09 ·  中国 海南 海口 电信
超级版主
★★★★
Credits 3,673
Posts 2,020
Joined 2016-02-01 00:00
10-year member
UID 181465
Gender Male
Status Offline
1<词>,2,3/段\,4{节},5(章)。
Floor 74 Posted 2016-06-26 20:09 ·  中国 海南 海口 电信
超级版主
★★★★
Credits 3,673
Posts 2,020
Joined 2016-02-01 00:00
10-year member
UID 181465
Gender Male
Status Offline
1<词>,2,3/段\,4{节},5(章)。
Floor 75 Posted 2016-06-26 20:10 ·  中国 海南 海口 电信
超级版主
★★★★
Credits 3,673
Posts 2,020
Joined 2016-02-01 00:00
10-year member
UID 181465
Gender Male
Status Offline
Author: Yu Yang
Link: https://www.zhihu.com/question/23374078/answer/69732605
Source: Zhihu
Copyright belongs to the author. For commercial reprints, please contact the author for authorization. For non-commercial reprints, please indicate the source.

Long, long ago, there was a group of people who decided to use 8 openable and closable transistors to combine into different states to represent everything in the world. They saw that the 8-switch state was good, so they called this "byte". Later, they made some machines that could handle these bytes. The machine started, and many states could be combined with bytes, and the states began to change. They saw that this was good, so they called this machine a "computer".

At first, computers were only used in the United States. An 8-bit byte can combine a total of 256 (2 to the 8th power) different states. They respectively specified 32 states starting from number 0 for special purposes. Once the terminal and printer encounter these agreed bytes, they will perform some agreed actions: when encountering 0x10, the terminal will wrap; when encountering 0x07, the terminal will beep to people; when encountering 0x1b, the printer will print reversed characters, or the terminal will display letters in color. They saw this was good, so they called these byte states below 0x20 "control codes". They also used consecutive byte states to represent all spaces, punctuation marks, numbers, uppercase and lowercase letters, and continued to number up to 127, so that computers could use different bytes to store English text. Everyone saw this and thought it was good, so everyone called this scheme the ANSI "Ascii" code (American Standard Code for Information Interchange). At that time, all computers in the world used the same ASCII scheme to store English text.

Later, just like building the Tower of Babel, computers began to be used all over the world, but many countries did not use English. Their alphabets had many that were not in ASCII. In order to be able to store their text in computers, they decided to use the empty positions after 127 to represent these new letters and symbols, and also added many horizontal lines, vertical lines, intersections, etc. needed for drawing tables, and continued to number up to the last state 255. The character set from 128 to 255 was called the "extended character set". Since then, the greedy humans have no new states to use. The United States may not have expected that people in the third world also wanted to use computers!

When the Chinese people got computers, there were no available byte states to represent Chinese characters, and there were more than 6,000 commonly used Chinese characters to be stored. But this did not stump the wise Chinese people. We directly canceled those strange symbols after 127 and stipulated: The meaning of a character less than 127 is the same as the original, but when two characters greater than 127 are connected together, it represents a Chinese character. The previous byte (he called it the high byte) is used from 0xA1 to 0xF7, and the latter byte (low byte) is from 0xA1 to 0xFE. In this way, we can combine more than 7,000 simplified Chinese characters. In these encodings, we also included mathematical symbols, Roman and Greek letters, Japanese katakana, etc., and all the numbers, punctuation marks, letters that were originally in ASCII were all re-encoded into two bytes, which are the so-called "full-width" characters, and those below 127 are called "half-width" characters. The Chinese people thought this was pretty good, so they called this Chinese character scheme "GB2312". GB2312 is an extension of ASCII for Chinese.

Because at that time, various countries all made their own encoding standards like China, and as a result, no one understood each other's encodings, and no one supported others' encodings. Even the brother regions of the Chinese mainland and Taiwan, which were only 150 nautical miles apart and used the same language, respectively adopted different DBCS encoding schemes. At that time, the Chinese people had to install a "Chinese character system" to display Chinese characters, which was specially used to handle the display and input of Chinese characters. For example, the fortune-telling program written by that ignorant and feudal person in Taiwan had to install another "Yitian Chinese character system" that supported BIG5 encoding to be used. If the wrong character system was installed, the display would be messed up! What to do? And there are those poor people in the world's ethnic forests who couldn't use computers for a while. What about their text? This is really the Tower of Babel proposition of computers!

Just then, the archangel Gabriel appeared in time - an international organization called ISO (International Organization for Standardization) decided to start solving this problem. The method they adopted was very simple: abandon all regional encoding schemes and start over to create an encoding that includes all cultures, all letters and symbols in the world! They planned to call it "Universal Multiple-Octet Coded Character Set", referred to as UCS, commonly known as "unicode".

When unicode started to be formulated, the memory capacity of computers had developed greatly, and space was no longer a problem. So ISO directly stipulated that two bytes, that is, 16 bits, must be used to uniformly represent all characters. For those "half-width" characters in ASCII, unicode keeps their original encoding unchanged, but only extends their length from the original 8 bits to 16 bits, and characters of other cultures and languages are all re-unifiedly encoded. Since the "half-width" English symbols only need to use the low 8 bits, their high 8 bits are always 0. Therefore, this grand scheme will waste twice as much space when storing English text.

At this time, programmers who came from the old society began to find a strange phenomenon: their strlen function was unreliable. A Chinese character was no longer equivalent to two characters, but one! Yes, starting from unicode, whether it is a half-width English letter or a full-width Chinese character, they are all unified "one character"! At the same time, they are also all unified "two bytes". Please pay attention to the difference between the two terms "character" and "byte". "Byte" is an 8-bit physical storage unit, while "character" is a culture-related symbol. In unicode, one character is two bytes. The era when one Chinese character is equivalent to two English characters is almost over.

Unicode is not perfect either. Here are two problems. One is, how to distinguish unicode from ascii? How can a computer know that three bytes represent one symbol instead of three symbols respectively? The second problem is that we already know that English letters only need to be represented by one byte. If unicode uniformly stipulates that each symbol is represented by three or four bytes, then there will necessarily be two to three bytes of 0 in front of each English letter, which is a great waste of storage space, and the size of the text file will be two to three times larger, which is unacceptable.

Unicode could not be popularized for a long time until the emergence of the Internet. In order to solve the problem of how unicode is transmitted on the network, many UTF (UCS Transfer Format) standards for transmission appeared. As the name implies, UTF-8 is to transmit data 8 bits at a time, and UTF-16 is to transmit 16 bits at a time. UTF-8 is the most widely used implementation of unicode on the Internet. It is an encoding designed for transmission and makes the encoding borderless, so that it can display characters of all cultures in the world. The biggest feature of UTF-8 is that it is a variable-length encoding method. It can use 1 to 4 bytes to represent a symbol, and the number of bytes changes according to different symbols. When the character is in the ASCII code range, it is represented by one byte, and the 1-byte encoding of ASCII characters is retained as part of it. Note that one Chinese character in unicode accounts for 2 bytes, and one Chinese character in UTF-8 accounts for 3 bytes). From unicode to utf-8, it is not a direct correspondence, but requires some algorithms and rules to convert.

Unicode symbol range | UTF-8 encoding method
(hexadecimal) | (binary) —————————————————————–
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx


Finally, a simple summary:
The Chinese people expanded and transformed the ASCII encoding to produce the GB2312 encoding, which can represent more than 6,000 commonly used Chinese characters.
There are too many Chinese characters, including traditional Chinese and various characters, so the GBK encoding was produced. It includes the encoding in GB2312 and expands a lot.
China is a multi-ethnic country, and almost every ethnic group has its own independent language system. In order to represent those characters, the GBK encoding was continued to be expanded into the GB18030 encoding.
Every country is like China, encoding its own language, so various encodings appeared. If you don't install the corresponding encoding, you can't explain the content that the corresponding encoding wants to express.
Finally, an organization called ISO couldn't stand it anymore. They created a coding UNICODE together. This coding is very large, so large that it can accommodate any text and symbol in the world. So as long as there is a UNICODE coding system on the computer, no matter which text in the world, as long as the file is saved in UNICODE encoding, it can be normally interpreted by other computers.
In network transmission, UNICODE has two standards, UTF-8 and UTF-16, which transmit 8 bits and 16 bits at a time respectively. Then someone will have a question, since UTF-8 can store so many texts and symbols, why are there still so many people using encodings such as GBK in China? Because encodings such as UTF-8 are relatively large in size and take up more computer space. If the vast majority of the target users are Chinese, using encodings such as GBK is also okay.


https://www.zhihu.com/question/23374078
https://wenku.baidu.com/view/cb9fe505cc17552707220865.html

[ Last edited by zzz19760225 on 2018-2-16 at 18:10 ]
1<词>,2,3/段\,4{节},5(章)。
Forum Jump: