中国DOS联盟

-- 联合DOS 推动DOS 发展DOS --

联盟域名：www.cn-dos.net 论坛域名：www.cn-dos.net/forum
DOS，代表着自由开放与发展，我们努力起来，学习FreeDOS和Linux的自由开放与GNU精神，共同创造和发展美好的自由与GNU GPL世界吧！

游客: 注册 | 登录 | 命令行 | 搜索 | 上传 | 帮助 »

中国DOS联盟论坛 » DOS启动盘 & LOGO技术（启动盘室） » [转帖] 关于IO.SYS的内嵌LOGO的格式

English/Chinese Fix Translation

Wengier
系统支持

“新DOS时代”站长

积分 27736
发帖 10521
注册 2002-10-9
状态离线

『楼主』: [转帖] 关于IO.SYS的内嵌LOGO的格式使用 LLM 解释/回答一下

关于MS-DOS 7.x的内嵌LOGO，我刚才在网上找到了以下资料：

Windows 95的封面没有独立文件形式的位图文件,这与WIN.COM对LOGOS.SYS的处理和USER.EXE对LOGOW.SYS的处理不同。笔者在剖析引导文件IO.SYS的过程中发现,Windows 95的启动封面徽标是嵌入在引导文件IO.SYS中的,并经DBLSPACE压缩过,图像数据区长度为64KB。在笔者剖析的Windows 95版本中,图像数据占据IO.SYS(文件长度223748B)217～344扇区,数据内部有多处"DS"标识,这是DBLSPACE压缩文件的标志,由于DBLSPACE是分段校验压缩文件,因此,即使此区内有一个字节的改动也会造成图像的大幅破坏。在上述IO.SYS中,图像显示执行代码部分起始于以"DBLSBIN$\LOGO.SYS"标志的地方并占据110-112扇区。由于DBLSPACE的压缩文件很复杂并分段校验,使我们不能像对Windows3.1的WIN.COM 那样通过改动组合文件或重编WIN.COM的部分代码来定做启动封面,但可以在MSDOS.SYS中以Logo=0来消隐显示。关于配置文件MSDOS.SYS的设定已有文献可查,本文附录2简述了其配置设定选择。
实际上,启动封面也是可以定做的。笔者在分析IO.SYS的过程中发现,IO.SYS在显示内部嵌入封面前先试图打开一个在引导目录下名为LOGO.SYS的文件(利用DOS功能调用INT 21H,AH=3DH子功能),若打开失败(此文件不存在)则转显示内部嵌入封面(在WIN.COM及USER.EXE中打开文件失败时则不显示图形,而代之以文本显示有关信息),若打开成功则做文件格式检查,其要求的格式比WIN.COM对LOGOS.SYS的检查条件苛刻。

IO.SYS对LOGO.SYS检查的代码如下:

debug io.sys
-u de0e
12B9:DE0E 813C424D CMP WORD PTR ,4D42
12B9:DE12 0F DB 0F
12B9:DE13 854801 TEST CX,
12B9:DE16 83C60E ADD SI,+0E
-u de19
12B9:DE19 833C28 CMP WORD PTR ,+28
-u de20
12B9:DE20 837C0C01 CMP WORD PTR ,+01
-u de28
12B9:DE28 837C0E08 CMP WORD PTR ,+08
-u de30
12B9:DE30 817C044001CMP WORD PTR ,0140
-u de39
12B9:DE39 817C089001CMP WORD PTR ,0190
-u de42
12B9:DE42 837C1000 CMP WORD PTR ,+00

从以上代码我们可以立即看出,所打开的文件是一个非压缩、幅度320×400、位面数为1、256色的位图文件。因此,可以用Paintbrush等工具形成一个256色非压缩、320像素×400像素的位图文件并命名为LOGO.SYS,将其?在引导目录下即可。要求的LOGO.SYS格式恰好与LOGOS.SYS及LOGOW.SYS的格式一致。作为验证,可将LOGOS.SYS或LOGOW.SYS命名为LOGO.SYS并放在引导目录下,实验证明完全可行。如果引导目录下有名为LOGO.SYS的文件,但没有通过上述所有检查,则拒绝显示并且也不再显示内部嵌入的徽标封面。

四、撤销对LOGO.SYS的格式检查并形成抖动变色的徽标封面
用上述方法显示的外部封面图像是静止的,而IO.SYS中的图像下面具有一个滚动的颜色条,如?的方法则可以使图像颜色变化抖动?，如果去掉引导目录下的 LOGO.SYS,则IO.SYS内嵌的图像也可以做到整幅图面"彩云流动",具有很强的动感。位图的抖动变色是由IO.SYS处理的,其有关处理标志嵌入到上面对LOGO.SYS 的判断语句内,因此将相关的语句作一下改动即可。方法为:用PCTOOLS或其它工具(DEBUG等)找到下面有下划线的部分,将其均改写为16进制机器代码90(nop不作任何操作的空指令),在首尾之间共60字节,其中含有一些代码在上文的反汇编中并未列出。

debug io.sys
-d de00
12B9:DE00 00 93 BA 02 00 E8 D6 02 -0F 82 52 01 8B F2 81 3C ..........R....
12B9:DE10 42 4D 0F 85 48 01 83 C6 -0E 83 3C 28 0F 85 3E 01 BM..H.....
12B9:DE20 83 7C 0C 01 0F 85 36 01 -83 7C 0E 08 0F 85 2E 01 .|....6..|.....
12B9:DE30 81 7C 04 40 01 0F 85 25 -01 81 7C 08 90 01 0F 85 .|.@...%..|....
12B9:DE40 1C 01 83 7C 10 00 0F 85 -14 01 8B 44 24 1E 2E 8E ...|.......D$..
12B9:DE50 1E 3E 0F A2 D8 02 F6 D8 -04 FF A2 D9 02 84 E4 74 .>.............
12B9:DE60 06 A3 DA 02 A3 DC 02 1F -2E C6 06 F2 8E 00 16 07 ...............
12B9:DE70 83 EC 26 8B FC BD 5F 03 -E8 45 02 B9 00 80 E8 E9 ..&..._..E.....

对IO.SYS作上述改动后,可以显示任何位图,但由于显示是根据系统显示驱动程序及屏幕调整图形大小,因此尺度上仍以原设定为好。若想做改动应先测试,结果可能会失真,但不影响显示及运行。经过上述改动后,在没有外部LOGO.SYS的情况下,内嵌的封面可以抖动和变色。若将LOGOS.SYS或LOGOW.SYS命名为LOGO.SYS并放在引导目录下即可代替内部封面而且具有变化的色彩。但若对这两个位图进行了编辑或是用Paintbrush形成的新文件一般不能变色,这是由于Paintbrush形成的文件实际用到的色彩很少,位图颜色表大部分是空的。要形成一个新的变色徽标应该:
1.用Paintbrush等形成一个普通256色位图;
2.使形成的位图具有完全的256颜色索引表,这可以从LOGOW.SYS或LOGOS.SYS 的颜色索引表中得到,即复制LOGOW.SYS或LOGOS.SYS文件偏移36H～436H的域到新文件相同的域。注意不要改动位图前0～36H字节。将新文件命名为LOGO.SYS并放在引导目录下即可。

五、撤销WIN.COM及USER.EXE对LOGOS.SYS 和LOGOW.SYS的格式检查
方法与前面类似,此处不再一一详述。
由于引导文件非常重要,在做改动时一定要在拷贝上改动,现在DOS7用IFSHLP.SYS可以处理长达255B的文件名,而此文件名的存储是将目录项属性字节改为0FH,即系统(04)+隐藏(02)+只读(01)+卷标(08)=VFAT的文件名属性(0FH),并利用多个目录项区将长文件名连续存放。为了防止数据丢失,DOS7屏蔽了绝对磁盘写INT 26H, 而PCTOOLS等工具的EDIT功能是调用INT 26H进行写盘的的,若在硬盘上改动会造成系统死锁而取消写盘,在软盘上则没有这个问题,因此建议对IO.SYS的修改在系统软磁盘上进行。
对DOS7的系统软盘形成很容易,在DOS7下初始化的磁盘,只将IO.SYS、MSDOS.SYS、COM-MAND.COM拷进即可引导Windows 95,这是由于DOS7有更精巧的 BOOT引导区,引导文件IO.SYS可以不连续存放、不占起始簇、文件名项不是第一目录项。

附录: 位图文件结构

本文涉及到的位图文件首部重要域
偏移长度(Bytes) 标识信息
00H 2 424DH 即 "BM"
0EH 4 位图信息头大小 28H=40字节
12H 4 位图宽度像素数 4001H=320个像素
16H 4 位图高度像素数 9001H=400个像素
1AH 2 位图目标设备位面数 1
1CH 2 位图阵列每像素所需位数,可取值为
1:单色,4:16色,8:256色,24:16兆色
1EH 4 位图压缩标志,可取值:0:未压缩,1:行程压缩8位位图,2:4位压缩位图

因此本文IO.SYS所读的位图LOGO.SYS是非压缩、位面数为1、 256色、宽320像素、高400像素的位图。

Regarding the embedded LOGO in MS-DOS 7.x, I just found the following information online:

The cover of Windows 95 does not have a bitmap file in an independent file form, which is different from the processing of LOGOS.SYS by WIN.COM and the processing of LOGOW.SYS by USER.EXE. The author discovered during the analysis of the boot file IO.SYS that the startup cover logo of Windows 95 is embedded in the boot file IO.SYS and compressed by DBLSPACE, with an image data area length of 64KB. In the version of Windows 95 analyzed by the author, the image data occupies sectors 217-344 of IO.SYS (file length 223748B), and there are multiple "DS" identifiers inside the data, which are signs of DBLSPACE compressed files. Since DBLSPACE is a segmented checksum compressed file, even a change of one byte in this area will cause a large-scale damage to the image. In the above IO.SYS, the part of the image display execution code starts at the place marked with "DBLSBIN$\LOGO.SYS" and occupies sectors 110-112. Because the compressed file of DBLSPACE is very complex and segmented checked, we cannot customize the startup cover like modifying the combined file or recompiling part of the code of WIN.COM for Windows 3.1, but we can blank the display by setting Logo=0 in MSDOS.SYS. The settings of the configuration file MSDOS.SYS are already available in relevant documents, and Appendix 2 of this article briefly describes the configuration setting options.

In fact, the startup cover can also be customized. The author found during the analysis of IO.SYS that before IO.SYS displays the embedded cover internally, it tries to open a file named LOGO.SYS in the boot directory (using the DOS function call INT 21H, AH=3DH sub-function). If the opening fails (this file does not exist), it switches to display the embedded cover internally (when the opening fails in WIN.COM and USER.EXE, it does not display the graphic but replaces it with text to display relevant information). If the opening is successful, it does a file format check, and the required format is more stringent than the check conditions of WIN.COM for LOGOS.SYS.

The code for IO.SYS to check LOGO.SYS is as follows:

debug io.sys
-u de0e
12B9:DE0E 813C424D CMP WORD PTR ,4D42
12B9:DE12 0F DB 0F
12B9:DE13 854801 TEST CX,
12B9:DE16 83C60E ADD SI,+0E
-u de19
12B9:DE19 833C28 CMP WORD PTR ,+28
-u de20
12B9:DE20 837C0C01 CMP WORD PTR ,+01
-u de28
12B9:DE28 837C0E08 CMP WORD PTR ,+08
-u de30
12B9:DE30 817C044001CMP WORD PTR ,0140
-u de39
12B9:DE39 817C089001CMP WORD PTR ,0190
-u de42
12B9:DE42 837C1000 CMP WORD PTR ,+00

From the above code, we can immediately see that the opened file is a non-compressed bitmap file with a size of 320×400, 1 plane, and 256 colors. Therefore, a 256-color non-compressed bitmap file with a size of 320 pixels × 400 pixels can be formed using tools like Paintbrush, named LOGO.SYS, and placed in the boot directory. The required format of LOGO.SYS happens to be consistent with the formats of LOGOS.SYS and LOGOW.SYS. As a verification, renaming LOGOS.SYS or LOGOW.SYS to LOGO.SYS and placing it in the boot directory is fully feasible. If there is a file named LOGO.SYS in the boot directory but it does not pass all the above checks, it will be refused to be displayed and the embedded logo cover will not be displayed either.

IV. Removing the format check for LOGO.SYS and forming a dither-colored logo cover

The external cover image displayed by the above method is static, and there is a scrolling color bar below the image in IO.SYS. The method above can make the image colors change dithered. If the LOGO.SYS in the boot directory is removed, the embedded image in IO.SYS can also achieve "clouds flowing" over the entire picture, with strong movement. The dither coloring of the bitmap is processed by IO.SYS, and the relevant processing flags are embedded in the above judgment statement for LOGO.SYS. Therefore, modifying the relevant statements can be done. The method is: use PCTOOLS or other tools (DEBUG, etc.) to find the parts underlined below, and rewrite them all to the hexadecimal machine code 90 (nop, an empty instruction that does nothing). There are 60 bytes between the beginning and the end, and some code is not listed in the above disassembly.

debug io.sys
-d de00
12B9:DE00 00 93 BA 02 00 E8 D6 02 -0F 82 52 01 8B F2 81 3C ..........R....
12B9:DE10 42 4D 0F 85 48 01 83 C6 -0E 83 3C 28 0F 85 3E 01 BM..H.....
12B9:DE20 83 7C 0C 01 0F 85 36 01 -83 7C 0E 08 0F 85 2E 01 .|....6..|.....
12B9:DE30 81 7C 04 40 01 0F 85 25 -01 81 7C 08 90 01 0F 85 .|.@...%..|....
12B9:DE40 1C 01 83 7C 10 00 0F 85 -14 01 8B 44 24 1E 2E 8E ...|.......D$..
12B9:DE50 1E 3E 0F A2 D8 02 F6 D8 -04 FF A2 D9 02 84 E4 74 .>.............
12B9:DE60 06 A3 DA 02 A3 DC 02 1F -2E C6 06 F2 8E 00 16 07 ...............
12B9:DE70 83 EC 26 8B FC BD 5F 03 -E8 45 02 B9 00 80 E8 E9 ..&..._..E.....

After making the above changes to IO.SYS, any bitmap can be displayed. But since the display is adjusted according to the system display driver and the screen, it is better to keep the original settings in terms of scale. If changes are to be made, testing should be done first, and the result may be distorted, but it does not affect the display and operation. After the above changes, the embedded cover can dither and change colors when there is no external LOGO.SYS. If LOGOS.SYS or LOGOW.SYS is renamed to LOGO.SYS and placed in the boot directory, it can replace the internal cover and have changing colors. However, if these two bitmaps are edited or a new file is formed using Paintbrush, it generally cannot change colors, because the actual colors used in the file formed by Paintbrush are very few, and most of the bitmap color tables are empty. To form a new dither-colored logo, the following should be done:
1. Form a common 256-color bitmap using Paintbrush, etc.;
2. Make the formed bitmap have a complete 256-color index table, which can be obtained from the color index table of LOGOW.SYS or LOGOS.SYS, that is, copy the area from offset 36H to 436H of LOGOW.SYS or LOGOS.SYS to the same area of the new file. Note that do not change the first 0-36H bytes of the bitmap. Name the new file LOGO.SYS and place it in the boot directory.

V. Removing the format check for LOGOS.SYS and LOGOW.SYS by WIN.COM and USER.EXE

The method is similar to the above, and details are not described one by one here.

Since the boot file is very important, changes must be made on the copy. Now, DOS7 uses IFSHLP.SYS to handle file names up to 255B, and the storage of this file name is to change the directory entry attribute byte to 0FH, that is, the VFAT file name attribute (0FH) which is system (04) + hidden (02) + read-only (01) + volume label (08) = 0FH. And use multiple directory entry areas to store the long file name continuously. To prevent data loss, DOS7 shields the absolute disk write INT 26H. The EDIT function of PCTOOLS and other tools calls INT 26H to write to the disk. If changes are made on the hard disk, it will cause the system to deadlock and cancel the disk writing. There is no this problem on the floppy disk. Therefore, it is recommended to modify IO.SYS on the system floppy disk.

It is very easy to form the system floppy disk of DOS7. The disk initialized under DOS7 only needs to copy IO.SYS, MSDOS.SYS, and COMMAND.COM to boot Windows 95, because DOS7 has a more sophisticated BOOT boot area. The boot file IO.SYS can be stored discontinuously, does not occupy the starting cluster, and the file name entry is not the first directory entry.

Appendix: Bitmap file structure

Important fields in the bitmap file header involved in this article
Offset Length(Bytes) Identification information
00H 2 424DH i.e. "BM"
0EH 4 Bitmap information header size 28H=40 bytes
12H 4 Bitmap width in pixels 4001H=320 pixels
16H 4 Bitmap height in pixels 9001H=400 pixels
1AH 2 Number of planes of the bitmap target device 1
1CH 2 Bits per pixel of the bitmap array, can take values:
1: monochrome, 4: 16-color, 8: 256-color, 24: 16 million colors
1EH 4 Bitmap compression flag, can take values: 0: uncompressed, 1: run-length compressed 8-bit bitmap, 2: 4-bit compressed bitmap

Therefore, the bitmap LOGO.SYS read by IO.SYS in this article is a non-compressed, 1-plane, 256-color, 320-pixel wide, 400-pixel high bitmap.

Wengier - 新DOS时代

欢迎大家来到我的“新DOS时代”网站，里面有各类DOS软件和资料，地址：
http://wendos.mycool.net/

E-Mail & MSN: wengierwu AT hotmail.com （最近比较忙，有事请联系DOSroot和雨露，谢谢！）

2003-4-18 00:00

查看资料发送邮件访问主页发短消息网志 OICQ

(29206679)

编辑帖子回复引用回复

Wengier
系统支持

“新DOS时代”站长

积分 27736
发帖 10521
注册 2002-10-9
状态离线

『第 2 楼』: 使用 LLM 解释/回答一下

我刚才成功地将Win95中内嵌的启动LOGO替换了MS-DOS 7.10的IO.SYS中的Win98启动LOGO后，用软盘、硬盘启动均一切正常，并且MS-DOS 7.10启动时显示Win95的启动LOGO了。

因此我想，只要将我们想要的BMP格式的LOGO用DBLSPACE压缩一下再嵌入IO.SYS中应该可以了。只是DBLSPACE使用起来非常麻烦，而且我没有条件来实现。Roy觉得呢？能进行实践一下吗？

2003-4-18 00:00

查看资料发送邮件访问主页发短消息网志 OICQ

(29206679)

编辑帖子回复引用回复

Roy
管理员

專業島民

积分 4869
发帖 1633
注册 2002-12-10
状态离线

『第 3 楼』: 使用 LLM 解释/回答一下

對不起啊,我不懂用dblspace.

我的網站：http://mw16.2ya.com/ 我的網誌: http://scrappedblog.blogspot.com/
~
我的Winamp正在播放的歌曲:

2003-4-19 00:00

查看资料发短消息网志

编辑帖子回复引用回复

Wengier
系统支持

“新DOS时代”站长

积分 27736
发帖 10521
注册 2002-10-9
状态离线

『第 4 楼』: 使用 LLM 解释/回答一下

我要是有一张没用的硬盘就可以来测试用DBLSPACE了。

2003-4-19 00:00

查看资料发送邮件访问主页发短消息网志 OICQ

(29206679)

编辑帖子回复引用回复

Wengier
系统支持

“新DOS时代”站长

积分 27736
发帖 10521
注册 2002-10-9
状态离线

『第 5 楼』: 使用 LLM 解释/回答一下

大家谁会用DBLSPACE的，帮帮忙好吗？

2003-4-19 00:00

查看资料发送邮件访问主页发短消息网志 OICQ

(29206679)

编辑帖子回复引用回复

ko20010214
版主

积分 7294
发帖 1628
注册 2002-10-16
状态离线

『第 6 楼』: 使用 LLM 解释/回答一下

词典编码

有许多场合，开始时不知道要编码数据的统计特性，也不一定允许你事先知道它们的统计特性。因此，人们提出了许许多多的数据压缩方法，企图用来对这些数据进行压缩编码，在实际编码过程中以尽可能获得最大的压缩比。这些技术统称为通用编码技术。词典编码(Dictionary Encoding)技术就是属于这一类，这种技术属于无损压缩技术。

4.4.1 词典编码的思想
词典编码(dictionary encoding)的根据是数据本身包含有重复代码这个特性。例如文本文件和光栅图像就具有这种特性。词典编码法的种类很多，归纳起来大致有两类。
第一类词典法的想法是企图查找正在压缩的字符序列是否在以前输入的数据中出现过，然后用已经出现过的字符串替代重复的部分，它的输出仅仅是指向早期出现过的字符串的“指针”。这种编码概念如图4-06所示。

图4-06 第一类词典法编码概念
这里所指的“词典”是指用以前处理过的数据来表示编码过程中遇到的重复部分。这类编码中的所有算法都是以Abraham Lempel和Jakob Ziv在1977年开发和发表的称为LZ77算法为基础的，例如1982年由Storer和Szymanski改进的称为LZSS算法就是属于这种情况。
第二类算法的想法是企图从输入的数据中创建一个“短语词典(dictionary of the phrases)”，这种短语不一定是像“严谨勤奋求实创新”和“国泰民安是坐稳总统宝座的根本”这类具有具体含义的短语，它可以是任意字符的组合。编码数据过程中当遇到已经在词典中出现的“短语”时，编码器就输出这个词典中的短语的“索引号”，而不是短语本身。这个概念如图4-07所示。

图4-07 第二类词典法编码概念
J.Ziv和A.Lempel在1978年首次发表了介绍这种编码方法的文章。在他们的研究基础上，Terry A.Weltch在1984年发表了改进这种编码算法的文章，因此把这种编码方法称为LZW(Lempel-Ziv Walch)压缩编码，首先在高速硬盘控制器上应用了这种算法。

4.4.2 LZ77算法
为了更好地说明LZ77算法的原理，首先介绍算法中用到的几个术语：
输入数据流(input stream)：要被压缩的字符序列。
字符(character)：输入数据流中的基本单元。
编码位置(coding position)：输入数据流中当前要编码的字符位置，指前向缓冲存储器中的开始字符。
前向缓冲存储器(Lookahead buffer)：存放从编码位置到输入数据流结束的字符序列的存储器。
窗口(window)：指包含W个字符的窗口，字符是从编码位置开始向后数也就是最后处理的字符数。
指针(pointer)：指向窗口中的匹配串且含长度的指针。
LZ77编码算法的核心是查找从前向缓冲存储器开始的最长的匹配串。编码算法的具体执行步骤如下：
把编码位置设置到输入数据流的开始位置。
查找窗口中最长的匹配串。
以“(Pointer, Length) Characters”的格式输出，其中Pointer是指向窗口中匹配串的指针，Length表示匹配字符的长度，Characters是前向缓冲存储器中的不匹配的第1个字符。
如果前向缓冲存储器不是空的，则把编码位置和窗口向前移(Length+1)个字符，然后返回到步骤2。
[例4.4] 待编码的数据流如表4-09所示，编码过程如表4-10所示。现作如下说明：
“步骤”栏表示编码步骤。
“位置”栏表示编码位置，输入数据流中的第1个字符为编码位置1。
“匹配串”栏表示窗口中找到的最长的匹配串。
“字符”栏表示匹配之后在前向缓冲存储器中的第1个字符。
“输出”栏以“(Back_chars, Chars_length) Explicit_character”格式输出。其中，(Back_chars, Chars_length)是指向匹配串的指针，告诉译码器“在这个窗口中向后退Back_chars个字符然后拷贝Chars_length个字符到输出”，Explicit_character是真实字符。例如，表4-10中的输出“(5,2) C”告诉译码器回退5个字符，然后拷贝2个字符“AB”

表4-09待编码的数据流
位置
1
2
3
4
5
6
7
8
9
9
字符
A
A
B
C
B
B
A
B
C

表4-10 编码过程
步骤
位置
匹配串
字符
输出
1
1
--
A
(0,0) A
2
2
2
A
B
(1,1) B
3
4
--
C
(0,0) C
4
5
B
B
(2,1) B
5
7
A B
C
(5,2) C
　
4.4.3 LZSS算法
LZ77通过输出真实字符解决了在窗口中出现没有匹配串的问题，但这个解决方案包含有冗余信息。冗余信息表现在两个方面，一是空指针，二是编码器可能输出额外的字符，这种字符是指可能包含在下一个匹配串中的字符。LZSS算法以比较有效的方法解决这个问题，它的思想是如果匹配串的长度比指针本身的长度长就输出指针，否则就输出真实字符。由于输出的压缩数据流中包含有指针和字符本身，为了区分它们就需要有额外的标志位，即ID位。
LZSS编码算法的具体执行步骤如下：
把编码位置置于输入数据流的开始位置。
在前向缓冲存储器中查找与窗口中最长的匹配串
① Pointer ：=匹配串指针。
② Length ：=匹配串长度。
判断匹配串长度Length是否大于等于最小匹配串长度(Length 3 MIN_LENGTH) ，
如果“是”：输出指针，然后把编码位置向前移动Length个字符。
如果“否”：输出前向缓冲存储器中的第1个字符，然后把编码位置向前移动一个字符。
如果前向缓冲存储器不是空的，就返回到步骤2。
[例4.5] 编码字符串如表4-11所示，编码过程如表4-12所示。现说明如下：
“步骤”栏表示编码步骤。
“位置”栏表示编码位置，输入数据流中的第1个字符为编码位置1。
“匹配”栏表示窗口中找到的最长的匹配串。
“字符”栏表示匹配之后在前向缓冲存储器中的第1个字符。
“输出”栏的输出为：
① 如果匹配串本身的长度Length 3 MIN_LENGTH，输出指向匹配串的指针，格式为(Back_chars, Chars_length)。该指针告诉译码器“在这个窗口中向后退Back_chars个字符然后拷贝Chars_length个字符到输出”。
② 如果匹配串本身的长度Length ￡ MIN_LENGTH，则输出真实的匹配串。

表4-11 输入数据流
位置
1
2
3
4
5
6
7
8
9
10
11
字符
A
A
B
B
C
C
B
B
A
A
B
C

表4-12 编码过程(MIN_LENGTH = 2)
步骤
位置
匹配串
输出
1
1
--
A
2
2
A
A
3
3
--
--
B
4
4
B
B
5
5
--
C
6
6
B B
(3,2)
7
8
A A B
(7,3)
8
11
C
C

在相同的计算机环境下，LZSS算法比LZ77可获得比较高的压缩比，而译码同样简单。这也就是为什么这种算法成为开发新算法的基础，许多后来开发的文档压缩程序都使用了LZSS的思想。例如，PKZip, ARJ, LHArc和ZOO等等，其差别仅仅是指针的长短和窗口的大小等有所不同。
LZSS同样可以和熵编码联合使用，例如ARJ就与霍夫曼编码联用，而PKZip则与Shannon-Fano联用，它的后续版本也采用霍夫曼编码。

4.4.4 LZ78算法
--------------------------------------------------------------------------------
在介绍LZ78算法之前，首先说明在算法中用到的几个术语和符号：
字符流(Charstream)：要被编码的数据序列。
字符(Character)：字符流中的基本数据单元。
前缀(Prefix)：在一个字符之前的字符序列。
缀-符串(String)：前缀＋字符。
码字(Code word)：码字流中的基本数据单元，代表词典中的一串字符。
码字流(Codestream)：码字和字符组成的序列，是编码器的输出。
词典(Dictionary)：缀-符串表。按照词典中的索引号对每条缀-符串(String)指定一个
码字(Code word)。
当前前缀(Current prefix)：在编码算法中使用，指当前正在处理的前缀，用符号P表示。
当前字符(Current character)：在编码算法中使用，指当前前缀之后的字符，用符号C表示。
当前码字(Current code word)：在译码算法中使用，指当前处理的码字，用W表示当前码字，String.W表示当前码字的缀-符串。

1. 编码算法
LZ78的编码思想是不断地从字符流中提取新的缀-符串(String)，通俗地理解为新“词条”，然后用“代号”也就是码字(Code word)表示这个“词条”。这样一来，对字符流的编码就变成了用码字(Code word)去替换字符流(Charstream)，生成码字流(Codestream)，从而达到压缩数据的目的。在编码开始时词典是空的，不包含任何缀-符串(string)。在这种情况下编码器就输出一个表示空字符串的特殊码字(例如“0”)和字符流中(Charstream)的第一个字符C，并把这个字符C添加到词典中作为一个由一个字符组成的缀-符串(string)。在编码过程中，如果出现类似的情况，也照此办理。在词典中已经包含某些缀-符串(String)之后，如果“当前前缀P +当前字符C”已经在词典中，就用字符C来扩展这个前缀，这样的扩展操作一直重复到获得一个在词典中没有的缀-符串(String)为止。此时就输出表示当前前缀P的码字(Code word)和字符C，并把P+C添加到词典中作为前缀(Prefix)，然后开始处理字符流(Charstream)中的下一个前缀。
LZ78编码器的输出是码字-字符(W,C)对，每次输出一对到码字流中，与码字W相对应的缀-符串(String)用字符C进行扩展生成新的缀-符串(String)，然后添加到词典中。LZ78编码的具体算法如下：
步骤1：在开始时，词典和当前前缀P都是空的。
步骤2：当前字符C ：=字符流中的下一个字符。
步骤3：判断P+C是否在词典中：
(1) 如果“是”：用C扩展P，让P ：= P+C ；
(2) 如果“否”：
① 输出与当前前缀P相对应的码字和当前字符C；
② 把字符串P+C 添加到词典中。
③ 令P ：=空值。
(3) 判断字符流中是否还有字符需要编码
① 如果“是”：返回到步骤2。
② 如果“否”：若当前前缀P不是空的，输出相应于当前前缀P的码字，然后结束编码。

2. 译码算法
在译码开始时译码词典是空的，它将在译码过程中从码字流中重构。每当从码字流中读入一对码字-字符(W,C)对时，码字就参考已经在词典中的缀-符串，然后把当前码字的缀-符串string.W 和字符C输出到字符流(Charstream)，而把当前缀-符串(string.W+C)添加到词典中。在译码结束之后，重构的词典与编码时生成的词典完全相同。LZ78译码的具体算法如下：
步骤1：在开始时词典是空的。
步骤2：当前码字W ：=码字流中的下一个码字。
步骤3：当前字符C ：= 紧随码字之后的字符。
步骤4：把当前码字的缀-符串(string.W)输出到字符流(Charstream)，然后输出字符C。
步骤5：把string.W+C添加到词典中。
步骤6：判断码字流中是否还有码字要译
(1) 如果“是”，就返回到步骤2。
(2) 如果“否”，则结束。
[例4.6] 编码字符串如表4-13所示，编码过程如表4-14所示。现说明如下：
“步骤”栏表示编码步骤。
“位置”栏表示在输入数据中的当前位置。
“词典”栏表示添加到词典中的缀-符串，缀-符串的索引等于“步骤”序号。
“输出”栏以(当前码字W, 当前字符C)简化为(W, C)的形式输出。

表4-13 编码字符串
位置
1
2
3
4
5
6
7
8
9
字符
A
B
B
C
B
B
C
A
B
A
表4-14 编码过程
步骤
位置
词典
输出
1
1
A
(0,A)
2
2
B
(0,B)
3
3
B C
(2,C)
4
4
5
B C A
(3,A)
5
8
B A
(2,A)

与LZ77相比，LZ78的最大优点是在每个编码步骤中减少了缀-符串(String)比较的数目，而压缩率与LZ77类似。

4.4.5 LZW算法
在LZW算法中使用的术语与LZ78使用的相同，仅增加了一个术语—前缀根(Root)，它是由单个字符串组成的缀-符串(String)。在编码原理上，LZW与LZ78相比有如下差别：①LZW只输出代表词典中的缀-符串(String)的码字(code word)。这就意味在开始时词典不能是空的，它必须包含可能在字符流出现中的所有单个字符，即前缀根(Root)。②由于所有可能出现的单个字符都事先包含在词典中，每个编码步骤开始时都使用一字符前缀(one-character prefix)，因此在词典中搜索的第1个缀-符串有两个字符。现将LZW编码算法和译码算法介绍如下。

1. 编码算法
LZW编码是围绕称为词典的转换表来完成的。这张转换表用来存放称为前缀(Prefix)的字符序列，并且为每个表项分配一个码字(Code word)，或者叫做序号，如表4-15所示。这张转换表实际上是把8位ASCII字符集进行扩充，增加的符号用来表示在文本或图像中出现的可变长度ASCII字符串。扩充后的代码可用9位、10位、11位、12位甚至更多的位来表示。Welch的论文中用了12位，12位可以有4096个不同的12位代码，这就是说，转换表有4096个表项，其中256个表项用来存放已定义的字符，剩下3840个表项用来存放前缀(Prefix)。

表4-15 词典
码字(Code word)
前缀(Prefix)
1
　
…
…
193
A
194
B
…
…
255
　
…
…
…
1305
abcdefxyF01234
…
…

LZW编码器(软件编码器或硬件编码器)就是通过管理这个词典完成输入与输出之间的转换。LZW编码器的输入是字符流(Charstream)，字符流可以是用8位ASCII字符组成的字符串，而输出是用n位(例如12位)表示的码字流(Codestream)，码字代表单个字符或多个字符组成的字符串。LZW编码器使用了一种很实用的分析(parsing)算法，称为贪婪分析算法(greedy parsing algorithm)。在贪婪分析算法中，每一次分析都要串行地检查来自字符流(Charstream)的字符串，从中分解出已经识别的最长的字符串，也就是已经在词典中出现的最长的前缀(Prefix)。用已知的前缀(Prefix)加上下一个输入字符C也就是当前字符(Current character)作为该前缀的扩展字符，形成新的扩展字符串——缀-符串(String)：Prefix.C。这个新的缀-符串(String)是否要加到词典中，还要看词典中是否存有和它相同的缀-符串String。如果有，那么这个缀-符串(String)就变成前缀(Prefix)，继续输入新的字符，否则就把这个缀-符串(String)写到词典中生成一个新的前缀(Prefix)，并给一个代码。

LZW编码算法的具体执行步骤如下：
步骤1：开始时的词典包含所有可能的根(Root)，而当前前缀P是空的；
步骤2：当前字符(C) ：=字符流中的下一个字符；
步骤3：判断缀-符串P+C是否在词典中
(1) 如果“是”：P ：= P+C // (用C扩展P) ；
(2) 如果“否”
① 把代表当前前缀P的码字输出到码字流;
② 把缀-符串P+C添加到词典;
③ 令P ：= C //(现在的P仅包含一个字符C);
步骤4：判断码字流中是否还有码字要译
(1) 如果“是”，就返回到步骤2；
(2) 如果“否”
① 把代表当前前缀P的码字输出到码字流;
② 结束。
LZW编码算法可用伪码表示。开始时假设编码词典包含若干个已经定义的单个码字。例如，256个字符的码字，用伪码可以表示成：
--------------------------------------------------------------------
Dictionary[j] ← all n single-character， j＝1, 2， …，n
j ← n+1
Prefix ← read first Character in Charstream
while((C ← next Character)!=NULL)
Begin
If Prefix.C is in Dictionary
Prefix ← Prefix.C
else
Codestream ← cW for Prefix
Dictionary[j] ← Prefix.C
j ← n+1
Prefix ← C
end
Codestream ← cW for Prefix
--------------------------------------------------------------------

2. 译码算法
LZW译码算法中还用到另外两个术语：①当前码字(Current code word)：指当前正在处理的码字，用cW表示，用string.cW表示当前缀-符串；②先前码字(Previous code word)：指先于当前码字的码字，用pW表示，用string.pW表示先前缀-符串。
LZW译码算法开始时，译码词典与编码词典相同，它包含所有可能的前缀根(roots)。LZW算法在译码过程中会记住先前码字(pW)，从码字流中读当前码字(cW)之后输出当前缀-符串string.cW，然后把用string.cW的第一个字符扩展的先前缀-符串string.pW添加到词典中。
LZW译码算法的具体执行步骤如下：
步骤1：在开始译码时词典包含所有可能的前缀根(Root)。
步骤2： cW ：=码字流中的第一个码字。
步骤3：输出当前缀-符串string.cW到码字流。
步骤4：先前码字pW ：= 当前码字cW。
步骤5：当前码字cW ：= 码字流中的下一个码字。
步骤6：判断先前缀-符串string.pW是否在词典中
(1) 如果“是”，则：
① 把先前缀-符串string.pW输出到字符流。
② 当前前缀P ：=先前缀-符串string.pW。
③ 当前字符C ：=当前前缀-符串string.cW的第一个字符。
④ 把缀-符串P+C添加到词典。
(2) 如果“否”，则：
① 当前前缀P ：=先前缀-符串string.pW。
② 当前字符C ：=当前缀-符串string.cW的第一个字符。
③ 输出缀-符串P+C到字符流，然后把它添加到词典中。
步骤7：判断码字流中是否还有码字要译
(1) 如果“是”，就返回到步骤4。
(2) 如果“否”, 结束。
LZW译码算法可用伪码表示如下:
----------------------------------------------------------------
Dictionary[j] ← all n single-character， j＝1, 2， …，n
j ← n+1
cW ← first code from Codestream
Charstream ← Dictionary[cW]
pW ← cW
While((cW ← next Code word)!=NULL)
Begin
If cW is in Dictionary
Charstream ← Dictionary[cW]
Prefix ← Dictionary[pW]
cW ← first Character of Dictionary[cW]
Dictionary[j] ← Prefix.cW
j ← n+1
pW ← cW
else
Prefix ← Dictionary[pW]
cW ← first Character of Prefix
Charstream ← Prefix.cW
Dictionary[j] ← Prefix.C
pW ← cW
j ← n+1
end
----------------------------------------------------------------

[例4.7] 编码字符串如表4-16所示，编码过程如表4-17所示。现说明如下：
“步骤”栏表示编码步骤；
“位置”栏表示在输入数据中的当前位置；
“词典”栏表示添加到词典中的缀-符串，它的索引在括号中；
“输出”栏表示码字输出。
表4-16 被编码的字符串
位置
1
1
2
3
4
5
6
7
8
9
字符
A
B
B
A
B
A
B
A
C
表4-17 LZW的编码过程
步骤
位置
词典
词典
输出
　　 (1)
A
　
　　 (2)
B
　
　　 (3)
C
　
1
1
(4)
A B
(1)
2
2
(5)
B B
(2)
3
3
3
(6)
B A
(2)
4
4
(7)
A B A
(4)
5
6
(8)
A B A C
(7)
6
--
--
--
(3)
表4-18解释了译码过程。每个译码步骤译码器读一个码字，输出相应的缀-符串，并把它添加到词典中。例如，在步骤4中，先前码字(2)存储在先前码字(pW)中，当前码字(cW)是(4)，当前缀-符串string.cW是输出(“A B”)，先前缀-符串string.pW ("B")是用当前缀-符串string.cW ("A")的第一个字符，其结果("B A") 添加到词典中，它的索引号是(6)

表4-18 LZW的译码过程
步骤
代码
词典
输出
　　 (1)
A
　
　　 (2)
B
　
　　 (3)
C
　
1
(1)
--
--
A
2
(2)
(2)
(4)
A B
B
3
(2)
(5)
B B
B
4
(4)
(6)
B A
A B
5
(7)
(7)
A B A
A B A
6
(3)
(8)
A B A C
C

LZW算法得到普遍采用，它的速度比使用LZ77算法的速度快，因为它不需要执行那么多的缀-符串比较操作。对LZW算法进一步的改进是增加可变的码字长度，以及在词典中删除老的缀-符串。在GIF图像格式和UNIX的压缩程序中已经采用了这些改进措施之后的LZW算法。LZW算法取得了专利，专利权的所有者是美国的一个大型计算机公司—Unisys(优利系统公司)，除了商业软件生产公司之外，可以免费使用LZW算法。

Dictionary Encoding

There are many occasions where the statistical characteristics of the data to be encoded are not known at the beginning, and it may not be allowed for you to know their statistical characteristics in advance. Therefore, people have proposed many data compression methods, trying to compress and encode these data, and to obtain the maximum compression ratio as much as possible in the actual encoding process. These technologies are collectively referred to as universal encoding technologies. Dictionary Encoding technology belongs to this category, and this technology belongs to lossless compression technology.

4.4.1 Idea of Dictionary Encoding
The basis of dictionary encoding is that the data itself contains the characteristic of repeated codes. For example, text files and raster images have this characteristic. There are many types of dictionary encoding methods, which can be roughly classified into two categories.
The idea of the first type of dictionary method is to attempt to find whether the character sequence being compressed has appeared in the previously input data, and then use the already appeared string to replace the repeated part. The output of this method is only a "pointer" pointing to the string that appeared earlier. The encoding concept of this type is shown in Figure 4-06.

Figure 4-06 Coding Concept of the First Type of Dictionary Method
The "dictionary" referred to here means using the previously processed data to represent the repeated parts encountered in the encoding process. All algorithms in this type of encoding are based on the LZ77 algorithm developed and published by Abraham Lempel and Jakob Ziv in 1977. For example, the LZSS algorithm improved by Storer and Szymanski in 1982 belongs to this situation.
The idea of the second type of algorithm is to attempt to create a "dictionary of the phrases" from the input data. This phrase does not necessarily have specific meanings like "rigorous, diligent, practical, innovative" and "national peace and people's happiness is the foundation for sitting firmly in the presidential seat", but it can be any combination of characters. When encountering a "phrase" that has already appeared in the dictionary during the process of encoding data, the encoder outputs the "index number" of the phrase in the dictionary instead of the phrase itself. The concept of this type is shown in Figure 4-07.

Figure 4-07 Coding Concept of the Second Type of Dictionary Method
J. Ziv and A. Lempel first published an article introducing this encoding method in 1978. On the basis of their research, Terry A. Weltch published an article to improve this encoding algorithm in 1984. Therefore, this encoding method is called LZW (Lempel-Ziv Walch) compression encoding, and this algorithm was first applied in high-speed hard disk controllers.

4.4.2 LZ77 Algorithm
To better illustrate the principle of the LZ77 algorithm, first introduce several terms used in the algorithm:
Input stream: The character sequence to be compressed.
Character: The basic unit in the input stream.
Coding position: The current character position in the input stream to be encoded, referring to the starting character in the lookahead buffer.
Lookahead buffer: The memory that stores the character sequence from the coding position to the end of the input stream.
Window: Refers to a window containing W characters, and the characters are counted backward from the coding position, that is, the number of characters processed last.
Pointer: A pointer pointing to the matching string in the window and containing the length.
The core of the LZ77 encoding algorithm is to find the longest matching string starting from the lookahead buffer. The specific execution steps of the encoding algorithm are as follows:
Set the coding position to the starting position of the input stream.
Find the longest matching string in the window.
Output in the format of "(Pointer, Length) Characters", where Pointer is the pointer pointing to the matching string in the window, Length represents the length of the matching characters, and Characters is the first unmatched character in the lookahead buffer.
If the lookahead buffer is not empty, move the coding position and the window forward by (Length + 1) characters, and then return to step 2.
The data stream to be encoded is shown in Table 4-09, and the encoding process is shown in Table 4-10. The following explanations are given:
The "Step" column indicates the encoding step.
The "Position" column indicates the coding position, and the first character in the input stream is coding position 1.
The "Matching string" column indicates the longest matching string found in the window.
The "Character" column indicates the first character in the lookahead buffer after matching.
The "Output" column is output in the format of "(Back_chars, Chars_length) Explicit_character". Among them, (Back_chars, Chars_length) is a pointer pointing to the matching string, telling the decoder "go back Back_chars characters in this window and then copy Chars_length characters to the output", and Explicit_character is the real character. For example, the output "(5,2) C" in Table 4-10 tells the decoder to go back 5 characters and then copy 2 characters "AB".

Table 4-09 Data Stream to be Encoded
Position
1
2
3
4
5
6
7
8
9
Character
A
A
B
C
B
B
A
B
C

Table 4-10 Encoding Process
Step
Position
Matching string
Character
Output
1
1
--
A
(0,0) A
2
2
2
A
B
(1,1) B
3
4
--
C
(0,0) C
4
5
B
B
(2,1) B
5
7
A B
C
(5,2) C

4.4.3 LZSS Algorithm
LZ77 solves the problem of no matching string in the window by outputting real characters, but this solution contains redundant information. The redundant information is manifested in two aspects: one is the empty pointer, and the other is that the encoder may output additional characters, which refer to the characters that may be included in the next matching string. The LZSS algorithm solves this problem in a more effective way. Its idea is to output the pointer if the length of the matching string is longer than the length of the pointer itself, otherwise output the real character. Since the output compressed data stream contains the pointer and the character itself, an additional flag bit, that is, the ID bit, is needed to distinguish them.
The specific execution steps of the LZSS encoding algorithm are as follows:
Place the coding position at the starting position of the input stream.
Find the longest matching string in the lookahead buffer with the window
① Pointer := matching string pointer.
② Length := matching string length.
Judge whether the matching string length Length is greater than or equal to the minimum matching string length (Length ≥ MIN_LENGTH).
If "yes": output the pointer, and then move the coding position forward by Length characters.
If "no": output the first character in the lookahead buffer, and then move the coding position forward by one character.
If the lookahead buffer is not empty, return to step 2.
The encoded string is shown in Table 4-11, and the encoding process is shown in Table 4-12. The following explanations are given:
The "Step" column indicates the encoding step.
The "Position" column indicates the coding position, and the first character in the input stream is coding position 1.
The "Matching" column indicates the longest matching string found in the window.
The "Character" column indicates the first character in the lookahead buffer after matching.
The output in the "Output" column is:
① If the length of the matching string itself Length ≥ MIN_LENGTH, output the pointer pointing to the matching string, in the format of (Back_chars, Chars_length). This pointer tells the decoder "go back Back_chars characters in this window and then copy Chars_length characters to the output".
② If the length of the matching string itself Length ≤ MIN_LENGTH, output the real matching string.

Table 4-11 Input Data Stream
Position
1
2
3
4
5
6
7
8
9
10
11
Character
A
A
B
B
C
C
B
B
A
A
B
C

Table 4-12 Encoding Process (MIN_LENGTH = 2)
Step
Position
Matching string
Output
1
1
--
A
2
2
A
A
3
3
--
--
B
4
4
B
B
5
5
--
C
6
6
B B
(3,2)
7
8
A A B
(7,3)
8
11
C
C

Under the same computer environment, the LZSS algorithm can obtain a relatively high compression ratio than LZ77, and the decoding is also simple. This is why this algorithm has become the basis for developing new algorithms. Many later-developed document compression programs use the idea of LZSS. For example, PKZip, ARJ, LHArc, and ZOO, etc. The difference is only in the length of the pointer and the size of the window, etc.
LZSS can also be used in combination with entropy encoding. For example, ARJ is used in combination with Huffman coding, and PKZip is used in combination with Shannon-Fano, and its subsequent versions also use Huffman coding.

4.4.4 LZ78 Algorithm
--------------------------------------------------------------------------------
Before introducing the LZ78 algorithm, first explain several terms and symbols used in the algorithm:
Charstream: The data sequence to be encoded.
Character: The basic data unit in the charstream.
Prefix: The character sequence before a character.
String: Prefix + character.
Code word: The basic data unit in the codestream, representing a string of characters in the dictionary.
Codestream: The sequence composed of code words and characters, which is the output of the encoder.
Dictionary: String table. Each string is assigned a code word according to the index number in the dictionary.
Current prefix (Current prefix): Used in the encoding algorithm, referring to the current prefix being processed, denoted by P.
Current character (Current character): Used in the encoding algorithm, referring to the character after the current prefix, denoted by C.
Current code word (Current code word): Used in the decoding algorithm, referring to the current code word being processed, denoted by W, and String.W refers to the string of the current code word.

1. Encoding Algorithm
The encoding idea of LZ78 is to continuously extract new strings from the charstream, which is commonly understood as new "entries", and then use the "code" that is the code word to represent this "entry". In this way, the encoding of the charstream becomes replacing the charstream with code words to generate a codestream, thereby achieving the purpose of compressing data. At the beginning of encoding, the dictionary is empty and does not contain any strings. In this case, the encoder outputs a special code word (such as "0") representing an empty string and the first character C in the charstream, and adds this character C to the dictionary as a string composed of one character. In the encoding process, if a similar situation occurs, it is handled in the same way. After the dictionary already contains some strings, if "current prefix P + current character C" is already in the dictionary, the prefix is expanded with the character C, and this expansion operation is repeated until a string that is not in the dictionary is obtained. At this time, the code word corresponding to the current prefix P and the character C are output, and P + C is added to the dictionary as a prefix, and then the next prefix in the charstream is processed.
The output of the LZ78 encoder is a code word-character (W, C) pair. Each time a pair is output to the codestream, the string corresponding to the code word W is expanded with the character C to generate a new string, which is then added to the dictionary. The specific algorithm of LZ78 encoding is as follows:
Step 1: At the beginning, the dictionary and the current prefix P are both empty.
Step 2: Current character C := next character in the charstream.
Step 3: Judge whether P + C is in the dictionary:
(1) If "yes": expand P with C, let P := P + C;
(2) If "no":
① Output the code word corresponding to the current prefix P and the current character C;
② Add the string P + C to the dictionary.
③ Let P := empty value.
(3) Judge whether there are still characters in the charstream to be encoded
① If "yes": return to step 2.
② If "no": if the current prefix P is not empty, output the code word corresponding to the current prefix P, and then end the encoding.

2. Decoding Algorithm
At the beginning of decoding, the decoding dictionary is empty, and it will be reconstructed from the codestream during the decoding process. Whenever a code word-character (W, C) pair is read from the codestream, the code word refers to the string already in the dictionary, then the string of the current code word string.W and the character C are output to the charstream, and the current string (string.W + C) is added to the dictionary. After decoding is completed, the reconstructed dictionary is exactly the same as the dictionary generated during encoding. The specific algorithm of LZ78 decoding is as follows:
Step 1: At the beginning, the dictionary is empty.
Step 2: Current code word W := next code word in the codestream.
Step 3: Current character C := the character following the code word.
Step 4: Output the string of the current code word (string.W) to the charstream, and then output the character C.
Step 5: Add string.W + C to the dictionary.
Step 6: Judge whether there are still code words in the codestream to be decoded
(1) If "yes", return to step 2.
(2) If "no", end.
The encoded string is shown in Table 4-13, and the encoding process is shown in Table 4-14. The following explanations are given:
The "Step" column indicates the encoding step.
The "Position" column indicates the current position in the input data.
The "Dictionary" column indicates the string added to the dictionary, and the index of the string is equal to the "Step" number.
The "Output" column is output in the form of (current code word W, current character C) simplified to (W, C).

Table 4-13 Encoded String
Position
1
2
3
4
5
6
7
8
9
Character
A
B
B
C
B
B
C
A
B
A

Table 4-14 Encoding Process
Step
Position
Dictionary
Output
1
1
A
(0,A)
2
2
B
(0,B)
3
3
B C
(2,C)
4
4
5
B C A
(3,A)
5
8
B A
(2,A)

Compared with LZ77, the biggest advantage of LZ78 is that the number of string comparisons is reduced in each encoding step, and the compression ratio is similar to that of LZ77.

4.4.5 LZW Algorithm
The terms used in the LZW algorithm are the same as those used in LZ78, and one more term - prefix root (Root) is added, which is a string composed of a single character. In terms of encoding principle, there are the following differences between LZW and LZ78: ① LZW only outputs the code word representing the string in the dictionary. This means that at the beginning, the dictionary cannot be empty, and it must contain all single characters that may appear in the charstream, that is, the prefix root (Root). ② Since all possible single characters are included in the dictionary in advance, a one-character prefix is used at the beginning of each encoding step. Therefore, the first string searched in the dictionary has two characters. The LZW encoding algorithm and decoding algorithm are introduced as follows.

1. Encoding Algorithm
LZW encoding is completed around a conversion table called a dictionary. This table is used to store the character sequence called the prefix, and a code word, or called a serial number, is assigned to each table item, as shown in Table 4-15. This conversion table actually expands the 8-bit ASCII character set, and the added symbols are used to represent variable-length ASCII strings that appear in text or images. The expanded code can be represented by 9 bits, 10 bits, 11 bits, 12 bits, or even more bits. Welch's paper uses 12 bits. 12 bits can have 4096 different 12-bit codes, which means that the conversion table has 4096 table items, among which 256 table items are used to store defined characters, and the remaining 3840 table items are used to store prefixes.

Table 4-15 Dictionary
Code word
Prefix
1
…
…
193
A
194
B
…
…
255
…
…
…
1305
abcdefxyF01234
…
…

The LZW encoder (software encoder or hardware encoder) completes the conversion between input and output by managing this dictionary. The input of the LZW encoder is a charstream, which can be a string composed of 8-bit ASCII characters, and the output is a codestream represented by n bits (for example, 12 bits). The code word represents a single character or a string composed of multiple characters. The LZW encoder uses a very practical parsing algorithm called the greedy parsing algorithm. In the greedy parsing algorithm, each parsing has to serially check the string from the charstream, and decompose the longest recognized string, that is, the longest prefix that has appeared in the dictionary. Use the known prefix (Prefix) plus the next input character C, that is, the current character (Current character), as the extended character of this prefix to form a new extended string - string: Prefix.C. Whether this new string should be added to the dictionary also depends on whether there is the same string in the dictionary. If yes, then this string becomes the prefix (Prefix), and continue to input new characters. Otherwise, this string is written to the dictionary to generate a new prefix (Prefix) and a code is given.
The specific execution steps of the LZW encoding algorithm are as follows:
Step 1: At the beginning, the dictionary contains all possible roots (Roots), and the current prefix P is empty;
Step 2: Current character (C) := next character in the charstream;
Step 3: Judge whether the string P + C is in the dictionary
(1) If "yes": P := P + C // (expand P with C);
(2) If "no"
① Output the code word representing the current prefix P to the codestream;
② Add the string P + C to the dictionary;
③ Let P := C // (now P only contains one character C);
Step 4: Judge whether there are still code words in the codestream to be decoded
(1) If "yes", return to step 2;
(2) If "no"
① Output the code word representing the current prefix P to the codestream;
② End.
The LZW encoding algorithm can be represented by pseudo-code. At the beginning, it is assumed that the encoding dictionary contains several defined single code words. For example, 256 character code words, which can be represented by pseudo-code as:
--------------------------------------------------------------------
Dictionary ← all n single-character, j = 1, 2, …, n
j ← n+1
Prefix ← read first Character in Charstream
while((C ← next Character)!=NULL)
Begin
If Prefix.C is in Dictionary
Prefix ← Prefix.C
else
Codestream ← cW for Prefix
Dictionary ← Prefix.C
j ← n+1
Prefix ← C
end
Codestream ← cW for Prefix
--------------------------------------------------------------------

2. Decoding Algorithm
The LZW decoding algorithm also uses two other terms: ① Current code word (Current code word): refers to the current code word being processed, denoted by cW, and string.cW refers to the current string; ② Previous code word (Previous code word): refers to the code word prior to the current code word, denoted by pW, and string.pW refers to the previous string.
At the beginning of LZW decoding, the decoding dictionary is the same as the encoding dictionary, and it contains all possible prefix roots (roots). The LZW algorithm remembers the previous code word (pW) during the decoding process. After reading the current code word (cW) from the codestream, the current string string.cW is output, and then the previous string string.pW expanded by the first character of string.cW is added to the dictionary.
The specific execution steps of the LZW decoding algorithm are as follows:
Step 1: At the beginning of decoding, the dictionary contains all possible prefix roots (Roots).
Step 2: cW := the first code word in the codestream.
Step 3: Output the current string string.cW to the codestream.
Step 4: Previous code word pW := current code word cW.
Step 5: Current code word cW := next code word in the codestream.
Step 6: Judge whether the previous string string.pW is in the dictionary
(1) If "yes", then:
① Output the previous string string.pW to the charstream.
② Current prefix P := previous string string.pW.
③ Current character C := the first character of the current string string.cW.
④ Add the string P + C to the dictionary.
(2) If "no", then:
① Current prefix P := previous string string.pW.
② Current character C := the first character of the current string string.cW.
③ Output the string P + C to the charstream, and then add it to the dictionary.
Step 7: Judge whether there are still code words in the codestream to be decoded
(1) If "yes", return to step 4.
(2) If "no", end.
The LZW decoding algorithm can be represented by pseudo-code as follows:
----------------------------------------------------------------
Dictionary ← all n single-character, j = 1, 2, …, n
j ← n+1
cW ← first code from Codestream
Charstream ← Dictionary
pW ← cW
While((cW ← next Code word)!=NULL)
Begin
If cW is in Dictionary
Charstream ← Dictionary
Prefix ← Dictionary
cW ← first Character of Dictionary
Dictionary ← Prefix.cW
j ← n+1
pW ← cW
else
Prefix ← Dictionary
cW ← first Character of Prefix
Charstream ← Prefix.cW
Dictionary ← Prefix.C
pW ← cW
j ← n+1
end
----------------------------------------------------------------

The encoded string is shown in Table 4-16, and the encoding process is shown in Table 4-17. The following explanations are given:
The "Step" column indicates the encoding step;
The "Position" column indicates the current position in the input data;
The "Dictionary" column indicates the string added to the dictionary, and its index is in parentheses;
The "Output" column indicates the code word output.
Table 4-16 Encoded String
Position
1
1
2
3
4
5
6
7
8
9
Character
A
B
B
A
B
A
B
A
C

Table 4-17 LZW Encoding Process
Step
Position
Dictionary
Dictionary
Output
　　 (1)
A
　
　　 (2)
B
　
　　 (3)
C
　
1
1
(4)
A B
(1)
2
2
(5)
B B
(2)
3
3
3
(6)
B A
(2)
4
4
(7)
A B A
(4)
5
6
(8)
A B A C
(7)
6
--
--
--
(3)

Table 4-18 explains the decoding process. In each decoding step, the decoder reads a code word, outputs the corresponding string, and adds it to the dictionary. For example, in step 4, the previous code word (2) is stored in the previous code word (pW), the current code word (cW) is (4), the current string string.cW is output ("A B"), the previous string string.pW ("B") is the first character of the current string string.cW ("A"), and the result ("B A") is added to the dictionary, and its index number is (6).

Table 4-18 LZW Decoding Process
Step
Code
Dictionary
Output
　　 (1)
A
　
　　 (2)
B
　
　　 (3)
C
　
1
(1)
--
--
A
2
(2)
(2)
(4)
A B
B
3
(2)
(5)
B B
B
4
(4)
(6)
B A
A B
5
(7)
(7)
A B A
A B A
6
(3)
(8)
A B A C
C

The LZW algorithm is widely adopted, and its speed is faster than that of the LZ77 algorithm because it does not need to perform so many string comparison operations. Further improvements to the LZW algorithm include increasing the variable code word length and deleting old strings in the dictionary. The LZW algorithm after these improvements has been adopted in the GIF image format and the UNIX compression program. The LZW algorithm has obtained a patent, and the owner of the patent is Unisys, a large computer company in the United States. Except for commercial software production companies, the LZW algorithm can be used for free.