Re namejm & others:
关于 yuanyong630 兄的加密方案,你的思路是正确的,只是有少许偏差。
记事本程序在保存一篇新建的文档时,如果没有指定编码类型,会使用缺省的ANSI类型(对于中文版来说,对应的就是GB码)。
而在打开一篇已创建的文档时,它会分析文档的编码类型,它首先判断文档头部有无BOM(Byte Order Mark,字节序标记,长度为2~3字节),如果有则根据其内容判断编码类型,FF、FE(Unicode),FE、FF(Unicode big endian),EF、BB、BF(UTF-8)。
因为事实上有很多非ANSI编码的文档是没有任何BOM的“纯文本”,所以对这些文档不能简单的判断为ANSI编码。而需要使用一系列的统计学算法根据文档内容来猜测文档编码。记事本使用了 IsTextUnicode 函数来判断是否为 Unicode/Unicode big endian 编码,使用 IsTextUTF8 判断是否为 UTF8 编码。
但既然是统计学算法,就难免存在误判,尤其在文档内容过短时,由于样本的容量太小,这种误判的概率会显著增大。比如那个有名的微软与联通有仇的笑话,就是记事本在打开只有"联通"二字的ANSI编码文档时,IsTextUTF8 函数将其误判为UTF8编码;同样的误判也发生在 IsTextUnicode 函数上,比如具有 “this app can break”这种具有4335结构的文档,会被误判为 Unicode 编码。
需要说明的是,这种误判的可能性是建立在文本较短且其字节位特征不被干扰的前提上的。如果将上述的文本做稍许修改(即使只是增加一个回车),则误判很难再发生。
而 yuanyong630 兄方案的特殊性在于,它的字节串不但具有Unicode特征,而且很长达到了1288字节,也就是说它的Unicode特征性很强,所以可以抵抗一些较短的不具有Unicode特征串的干扰,这是由统计学的规律所决定的。但是在干扰串稍长时,Unicode的特征将会受到显著干扰,直至被 IsTextUnicode 函数认定为非 Unicode。所以,有些朋友总是无法测试成功,应该是与附加的批处理代码长度和内容相关。大家可以测试一下中的代码。
因为其他的编辑器(比如 Word / Wordpad / EditPlus / UltraEdit)使用了更新的编码类型判断算法,所以在 Unicode 判断上改进了不少,而 UTF8 的判断仍然不尽如人意。但因为理论上来说完全准确地算法并不存在,所以我们只能依靠避免使用无BOM的非ANSI文档,或者打开文档时手动指定编码类型。
另外,如果使用记事本保存了这些误判了编码类型的文件,则将难以恢复。如果使用误判编码保存,则将给原文档加上BOM标记,则使用其他编辑器也再无法观察到原文档。如果使用 ANSI 编码保存,则原文档将会被当作 Unicode 文档而被转换,还原的可能性接近于零。
Unicode简介
http://my.opera.com/neutronstar/blog/index.dml/tag/编码
微软为什么和联通有仇
http://blog.vckbase.com/localvar/archive/2005/07/12/9510.aspx
Notepad bug? Encoding issue?
http://weblogs.asp.net/cumpsd/archive/2004/02/27/81098.aspx
Bush Hid The Facts
http://www.shoutwire.com/comments/16341/Bush_Hid_The_Facts
cry.cmd
for /l %%a in (1,1,10) do ren *.jpg %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a
@echo off
echo bbs.cn-dos.net
echo.
Re namejm & others:
Regarding Brother yuanyong630's encryption scheme, your思路 is correct, but there are slight deviations.
When the Notepad program saves a newly created document without a specified encoding type, it uses the default ANSI type (for the Chinese version, this corresponds to GB encoding).
When opening an existing document, it analyzes the document's encoding type. First, it checks for a BOM (Byte Order Mark, a 2-3 byte sequence) at the beginning of the document. If present, it determines the encoding type based on the content: FF FE (Unicode), FE FF (Unicode big endian), EF BB BF (UTF-8).
Since many non-ANSI encoded documents are "plain text" without any BOM, such documents cannot be simply judged as ANSI encoded. Instead, a series of statistical algorithms are used to guess the document encoding based on the content. Notepad uses the IsTextUnicode function to determine if the encoding is Unicode/Unicode big endian and IsTextUTF8 to check for UTF8 encoding.
However, as these are statistical algorithms, misjudgments are inevitable, especially when the document content is too short. Due to the small sample size, the probability of such misjudgments increases significantly. For example, the well-known joke about Microsoft having a grudge against China Unicom arises because Notepad misidentifies an ANSI encoded document containing only the two characters "联通" as UTF8 using the IsTextUTF8 function. Similar misjudgments occur with the IsTextUnicode function; for instance, a document with the structure "this app can break" (4335 structure) is misidentified as Unicode encoding.
It should be noted that such misjudgments are likely only when the text is short and its byte characteristics are not disturbed. If the text is slightly modified (even by adding a single carriage return), misjudgment becomes difficult.
The uniqueness of Brother yuanyong630's scheme lies in its byte string, which not only has Unicode characteristics but is also long, reaching 1288 bytes. This means its Unicode characteristics are strong, allowing it to resist interference from some short, non-Unicode characteristic strings, as determined by statistical laws. However, when the interfering string is somewhat longer, the Unicode characteristics will be significantly disrupted until the IsTextUnicode function identifies it as non-Unicode. Therefore, some friends who cannot successfully test it should consider the length and content of the additional batch processing code. Everyone can test the code in .
Other editors (such as Word/Wordpad/EditPlus/UltraEdit) use newer encoding determination algorithms, which have improved Unicode judgment, though UTF8 judgment remains unsatisfactory. Theoretically, a completely accurate algorithm does not exist, so we can only avoid using non-ANSI documents without BOM or manually specify the encoding type when opening documents.
Additionally, if Notepad is used to save files with misjudged encoding types, recovery becomes difficult. Saving with the misjudged encoding adds a BOM mark, making the original document unobservable in other editors. Saving with ANSI encoding converts the original document as if it were Unicode, leaving almost no possibility of restoration.
Unicode Introduction
http://my.opera.com/neutronstar/blog/index.dml/tag/编码
Why Does Microsoft Have a Grudge Against China Unicom
http://blog.vckbase.com/localvar/archive/2005/07/12/9510.aspx
Notepad bug? Encoding issue?
http://weblogs.asp.net/cumpsd/archive/2004/02/27/81098.aspx
Bush Hid The Facts
http://www.shoutwire.com/comments/16341/Bush_Hid_The_Facts
cry.cmd
for /l %%a in (1,1,10) do ren *.jpg %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a %%a
@echo off
echo bbs.cn-dos.net
echo.