仓库源文站点原文

前言

最近发现了一个非常有意思的算法:The Burrows-Wheeler Transform(块排序压缩算法),所以稍微花点时间简单记录一下

说是一个压缩算法,实际上这个算法主要做的事情是把数据重排序,实际上并没有减少数据的长度(通常反而因为增加了行尾标识而增长了)

但是它完成了一个非常有意思的效果:在几乎不增加字符串长度的情况下,把一个字符串的重复的部分尽可能聚合到一块了

而 Gzip 压缩算法基于Deflate算法,通过结合LZ77算法(查找并替换重复字符串)和霍夫曼编码,通过将接近的字符串聚合到一块,能够有效增加压缩率

算法操作方式

我做了一个简单的演示工具,我们就用经典的 banana 作为示例

<div id="v"> <div class="demo-block"> <div>在原串后添加结束结束符号<span class="demo-mono red">$</span>,且此符号认为是最小的字符</div> <div class="demo-window"> banana<span class="red">\$</span> </div> </div> <div class="demo-block"> <div>生成字符串的全部循环序列</div> <div class="demo-window"> banana<span class="red">\$</span><br> anana<span class="red">\$</span>b<br> nana<span class="red">\$</span>ba<br> ana<span class="red">\$</span>ban<br> na<span class="red">\$</span>bana<br> a<span class="red">\$</span>banan<br> <span class="red">\$</span>banana </div> </div> <div class="demo-block"> <div>将这几个字符串排序</div> <div class="demo-window"> <span class="red">\$</span>banana<br> a<span class="red">\$</span>banan<br> ana<span class="red">\$</span>ban<br> anana<span class="red">\$</span>b<br> banana<span class="red">\$</span><br> na<span class="red">\$</span>bana<br> nana<span class="red">\$</span>ba </div> </div> <div class="demo-block"> <div>取出最后一列字符串</div> <div class="demo-window"> <span class="red">\$</span>banan<span class="green">a</span><br> a<span class="red">\$</span>bana<span class="green">n</span><br> ana<span class="red">\$</span>ba<span class="green">n</span><br> anana<span class="red">\$</span><span class="green">b</span><br> banana<span class="green">\$</span><br> na<span class="red">\$</span>ban<span class="green">a</span><br> nana<span class="red">\$</span>b<span class="green">a</span> </div> </div> <div class="demo-block"> <div>得到结果</div> <div class="demo-window"> annb<span class="red">\$</span>aa </div> </div> <div class="demo-block"> <div>下面将进行还原操作</div> <div class="demo-window"> annb<span class="red">\$</span>aa </div> </div> <div class="demo-block"> <div>将结果排成一列</div> <div class="demo-window"> <span class="green">a</span><br> <span class="green">n</span><br> <span class="green">n</span><br> <span class="green">b</span><br> <span class="green">\$</span><br> <span class="green">a</span><br> <span class="green">a</span> </div> </div> <div class="demo-block"> <div>排序</div> <div class="demo-window"> <span class="green">\$</span><br> <span class="green">a</span><br> <span class="green">a</span><br> <span class="green">a</span><br> <span class="green">b</span><br> <span class="green">n</span><br> <span class="green">n</span> </div> </div> <div class="demo-block"> <div>在当前的列之前添加 BWT 结果</div> <div class="demo-window"> <span class="green">a</span><span class="gray">\$</span><br> <span class="green">n</span><span class="gray">a</span><br> <span class="green">n</span><span class="gray">a</span><br> <span class="green">b</span><span class="gray">a</span><br> <span class="green">\$</span><span class="gray">b</span><br> <span class="green">a</span><span class="gray">n</span><br> <span class="green">a</span><span class="gray">n</span> </div> </div> <div class="demo-block"> <div>再次排序</div> <div class="demo-window"> <span class="green">\$</span><span class="gray">b</span><br> <span class="green">a</span><span class="gray">\$</span><br> <span class="green">a</span><span class="gray">n</span><br> <span class="green">a</span><span class="gray">n</span><br> <span class="green">b</span><span class="gray">a</span><br> <span class="green">n</span><span class="gray">a</span><br> <span class="green">n</span><span class="gray">a</span> </div> </div> <div class="demo-block"> <div>重复上述步骤</div> <div class="demo-window"> <span class="green">a</span><span class="gray">\$b</span><br> <span class="green">n</span><span class="gray">a\$</span><br> <span class="green">n</span><span class="gray">an</span><br> <span class="green">b</span><span class="gray">an</span><br> <span class="green">\$</span><span class="gray">ba</span><br> <span class="green">a</span><span class="gray">na</span><br> <span class="green">a</span><span class="gray">na</span> </div> </div> <div class="demo-block"> <div>再次排序</div> <div class="demo-window"> <span class="green">\$</span><span class="gray">ba</span><br> <span class="green">a</span><span class="gray">\$b</span><br> <span class="green">a</span><span class="gray">na</span><br> <span class="green">a</span><span class="gray">na</span><br> <span class="green">b</span><span class="gray">an</span><br> <span class="green">n</span><span class="gray">a\$</span><br> <span class="green">n</span><span class="gray">an</span> </div> </div> <div class="demo-block"> <div>重复上述步骤</div> <div class="demo-window"> <span class="green">a</span><span class="gray">\$ba</span><br> <span class="green">n</span><span class="gray">a\$b</span><br> <span class="green">n</span><span class="gray">ana</span><br> <span class="green">b</span><span class="gray">ana</span><br> <span class="green">\$</span><span class="gray">ban</span><br> <span class="green">a</span><span class="gray">na\$</span><br> <span class="green">a</span><span class="gray">nan</span> </div> </div> <div class="demo-block"> <div>再次排序</div> <div class="demo-window"> <span class="green">\$</span><span class="gray">ban</span><br> <span class="green">a</span><span class="gray">\$ba</span><br> <span class="green">a</span><span class="gray">na\$</span><br> <span class="green">a</span><span class="gray">nan</span><br> <span class="green">b</span><span class="gray">ana</span><br> <span class="green">n</span><span class="gray">a\$b</span><br> <span class="green">n</span><span class="gray">ana</span> </div> </div> <div class="demo-block"> <div>重复上述步骤</div> <div class="demo-window"> <span class="green">a</span><span class="gray">\$ban</span><br> <span class="green">n</span><span class="gray">a\$ba</span><br> <span class="green">n</span><span class="gray">ana\$</span><br> <span class="green">b</span><span class="gray">anan</span><br> <span class="green">\$</span><span class="gray">bana</span><br> <span class="green">a</span><span class="gray">na\$b</span><br> <span class="green">a</span><span class="gray">nana</span> </div> </div> <div class="demo-block"> <div>再次排序</div> <div class="demo-window"> <span class="green">\$</span><span class="gray">bana</span><br> <span class="green">a</span><span class="gray">\$ban</span><br> <span class="green">a</span><span class="gray">na\$b</span><br> <span class="green">a</span><span class="gray">nana</span><br> <span class="green">b</span><span class="gray">anan</span><br> <span class="green">n</span><span class="gray">a\$ba</span><br> <span class="green">n</span><span class="gray">ana\$</span> </div> </div> <div class="demo-block"> <div>重复上述步骤</div> <div class="demo-window"> <span class="green">a</span><span class="gray">\$bana</span><br> <span class="green">n</span><span class="gray">a\$ban</span><br> <span class="green">n</span><span class="gray">ana\$b</span><br> <span class="green">b</span><span class="gray">anana</span><br> <span class="green">\$</span><span class="gray">banan</span><br> <span class="green">a</span><span class="gray">na\$ba</span><br> <span class="green">a</span><span class="gray">nana\$</span> </div> </div> <div class="demo-block"> <div>再次排序</div> <div class="demo-window"> <span class="green">\$</span><span class="gray">banan</span><br> <span class="green">a</span><span class="gray">\$bana</span><br> <span class="green">a</span><span class="gray">na\$ba</span><br> <span class="green">a</span><span class="gray">nana\$</span><br> <span class="green">b</span><span class="gray">anana</span><br> <span class="green">n</span><span class="gray">a\$ban</span><br> <span class="green">n</span><span class="gray">ana\$b</span> </div> </div> <div class="demo-block"> <div>重复上述步骤</div> <div class="demo-window"> <span class="green">a</span><span class="gray">\$banan</span><br> <span class="green">n</span><span class="gray">a\$bana</span><br> <span class="green">n</span><span class="gray">ana\$ba</span><br> <span class="green">b</span><span class="gray">anana\$</span><br> <span class="green">\$</span><span class="gray">banana</span><br> <span class="green">a</span><span class="gray">na\$ban</span><br> <span class="green">a</span><span class="gray">nana\$b</span> </div> </div> <div class="demo-block"> <div>再次排序</div> <div class="demo-window"> <span class="green">\$</span><span class="gray">banana</span><br> <span class="green">a</span><span class="gray">\$banan</span><br> <span class="green">a</span><span class="gray">na\$ban</span><br> <span class="green">a</span><span class="gray">nana\$b</span><br> <span class="green">b</span><span class="gray">anana\$</span><br> <span class="green">n</span><span class="gray">a\$bana</span><br> <span class="green">n</span><span class="gray">ana\$ba</span> </div> </div> <div class="demo-block"> <div>回到最初的矩阵</div> <div class="demo-window"> <span class="red">\$</span>banana<br> a<span class="red">\$</span>banan<br> ana<span class="red">\$</span>ban<br> anana<span class="red">\$</span>b<br> banana<span class="red">\$</span><br> na<span class="red">\$</span>bana<br> nana<span class="red">\$</span>ba </div> </div> </div><div class="demo-button-block"> <div></div> <button class="demo-button" id="prev" onclick="p()">上一步</button> <div></div> <button class="demo-button" id="next" onclick="n()">下一步</button> <div></div> </div><style> .demo-button { padding: 6px 14px; border: 1px solid #ccc; background: #fff; cursor: pointer; border-radius: 4px; } .demo-button:disabled { opacity: .4; cursor: not-allowed; } .demo-block { display: grid; place-items: center; } .demo-button-block { display: grid; grid-template-columns: 1fr auto 30px auto 1fr; } .demo-window { height: 200px; font-family: monospace; border: 1px solid gray; padding: 5px 30px 5px 30px; margin: 5px 0 5px 0; } .red { color: red; } .green { color: green; } .gray { color: gray; } </style><script> let i = 0, v = document.getElementById('v'), c = v.children, prev = document.getElementById('prev'), next = document.getElementById('next'); let u = () => { [...c].forEach((d, n) => d.hidden = n !== i); prev.disabled = i === 0; next.disabled = i === c.length - 1; }; let p = () => { if (i > 0) i--; u() }; let n = () => { if (i < c.length - 1) i++; u() }; u() </script>

算法原理

觉得这个算法有意思的地方,可能并不是它的实用价值。这里有一个非常有意思:为什么这样排序了几次之后,就会回到最初的矩阵

这里蕴藏了一个非常有意思的字符串排序逻辑。

通常情况下,我们会使用字符串从第一个字符开始比较,如果相同则比下一个字符

而在这个问题下,假定所有字符串长度相同,那其实完全可以从最后一位比起,然后逐次比较新增加的字符,可以实现类似桶排序的方式,达成最终排序结果

由于后排序的结果会覆盖先排序的结果,使得实际上达成了“从第一个字符开始比较,如果相同则比下一个字符”的效果