layout: post title: 下载整理中国哲学电子书的脚本 categories:
我喜欢将一些书的文本存在手机上, 这样可以利用碎片化的时间随时阅读几句. 中国的古文最适合这样的目的, 因为大多很短, 且言简意赅. 中国哲学书电子化计划网站上有很多整理好的古文文本, 但保存不方便, 所以就想能不能自动下载整理呢? 根据网站的地址和格式分析了一下, 似乎可行. 脚本如下:
<table class="highlighttable"><th colspan="2" style="text-align:left">ctxt.bsh</th><tr><td><div class="linenodiv" style="background-color: #f0f0f0; padding-right: 10px"><pre style="line-height: 125%"> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43</pre></div></td><td class="code"><div class="highlight"><pre style="line-height:125%"><span></span><span style="color: #008800; font-style: italic"># url="http://ctext.org/analects/zhs"; trs=1</span> <span style="color: #008800; font-style: italic"># url="http://ctext.org/mengzi/zhs"; trs=0</span> <span style="color: #008800; font-style: italic"># url="http://ctext.org/xunzi/zhs"; trs=0</span> <span style="color: #008800; font-style: italic"># url="http://ctext.org/kongzi-jiayu/zhs"</span> <span style="color: #008800; font-style: italic"># url="http://ctext.org/shi-shuo-xin-yu/zhs"</span> <span style="color: #008800; font-style: italic"># url="http://ctext.org/yan-shi-jia-xun/zhs"</span> <span style="color: #008800; font-style: italic"># url="http://ctext.org/zhuangzi/inner-chapters/zhs"</span> <span style="color: #008800; font-style: italic"># url="http://ctext.org/book-of-changes/zhs"</span> <span style="color: #B8860B">url</span><span style="color: #666666">=</span><span style="color: #BB4444">"http://ctext.org/shang-shu/zhs"</span>; <span style="color: #B8860B">trs</span><span style="color: #666666">=</span>0 <span style="color: #AA22FF">export</span> <span style="color: #B8860B">LANG</span><span style="color: #666666">=</span><span style="color: #AA22FF; font-weight: bold">$(</span>locale -uU<span style="color: #AA22FF; font-weight: bold">)</span> <span style="color: #008800; font-style: italic"># 设定中文支持</span> curl <span style="color: #B8860B">$url</span> > _chp awk <span style="color: #BB4444">'</span> BEGIN<span style="color: #666666">{</span>system<span style="color: #666666">(</span><span style="color: #BB4444">"rm -rf _ctx"</span><span style="color: #666666">)}</span> / <a <span style="color: #B8860B">class</span><span style="color: #666666">=</span><span style="color: #BB4444">"menuitem"</span>/ <span style="color: #666666">{</span> sub<span style="color: #666666">(</span>/.*href<span style="color: #666666">=</span><span style="color: #BB6622; font-weight: bold">\"</span>/, <span style="color: #BB4444">""</span><span style="color: #666666">)</span> sub<span style="color: #666666">(</span>/<span style="color: #BB6622; font-weight: bold">\"</span>.*/, <span style="color: #BB4444">""</span><span style="color: #666666">)</span> <span style="color: #B8860B">url</span><span style="color: #666666">=</span><span style="color: #BB4444">"http://ctext.org/"</span><span style="color: #B8860B">$0</span> print url system<span style="color: #666666">(</span><span style="color: #BB4444">"curl "</span>url<span style="color: #BB4444">" >>_ctx"</span><span style="color: #666666">)</span> <span style="color: #666666">}</span> <span style="color: #BB4444">'</span> _chp awk -v <span style="color: #B8860B">trs</span><span style="color: #666666">=</span><span style="color: #B8860B">$trs</span> <span style="color: #BB4444">'</span> BEGIN <span style="color: #666666">{</span><span style="color: #B8860B">chp</span><span style="color: #666666">=</span>0; <span style="color: #B8860B">tot</span><span style="color: #666666">=</span>0<span style="color: #666666">}</span> /<div <span style="color: #B8860B">id</span><span style="color: #666666">=</span><span style="color: #BB4444">"content3"</span>/ <span style="color: #666666">{</span> gsub<span style="color: #666666">(</span>/^<span style="color: #666666">[</span>^《<span style="color: #666666">]</span>+《/,<span style="color: #BB4444">"《"</span><span style="color: #666666">)</span> gsub<span style="color: #666666">(</span>/》.+/,<span style="color: #BB4444">"》"</span><span style="color: #666666">)</span> chp++; <span style="color: #B8860B">sec</span><span style="color: #666666">=</span>0 print <span style="color: #BB4444">"第 "</span>chp<span style="color: #BB4444">" 章 "</span><span style="color: #B8860B">$0</span><span style="color: #BB4444">"\n"</span> <span style="color: #666666">}</span> /<div <span style="color: #B8860B">id</span><span style="color: #666666">=</span><span style="color: #BB4444">"comm[0-9]+"</span>/ <span style="color: #666666">{</span> gsub<span style="color: #666666">(</span>/<<span style="color: #666666">[</span>^><span style="color: #666666">]</span>+>/,<span style="color: #BB4444">""</span><span style="color: #666666">)</span> sec++; <span style="color: #AA22FF; font-weight: bold">if</span><span style="color: #666666">(</span>trs<span style="color: #666666">)</span> <span style="color: #666666">{</span> <span style="color: #AA22FF; font-weight: bold">if</span><span style="color: #666666">(</span>sec%2<span style="color: #666666">)</span> print chp<span style="color: #BB4444">"."</span><span style="color: #666666">(</span>sec+1<span style="color: #666666">)</span>/2, <span style="color: #B8860B">$0</span> <span style="color: #AA22FF; font-weight: bold">else</span> <span style="color: #666666">{</span>tot++; print tot<span style="color: #BB4444">" "</span><span style="color: #B8860B">$0</span><span style="color: #BB4444">"\n"</span><span style="color: #666666">}</span> <span style="color: #666666">}</span> <span style="color: #AA22FF; font-weight: bold">else</span> <span style="color: #666666">{</span> tot++; print chp<span style="color: #BB4444">"."</span>sec<span style="color: #BB4444">"/"</span>tot, <span style="color: #B8860B">$0</span><span style="color: #BB4444">"\n"</span> <span style="color: #666666">}</span> <span style="color: #666666">}</span> <span style="color: #BB4444">'</span> _ctx > _ctx.txt </pre></div> </td></tr></table>可惜的是, 这个网站明言请注意:严禁使用自动下载软体下载本网站的大量网页,违者自动封锁,不另行通知。
根据我的测试, 还确实是这样, 下载量大了之后自动封IP. 要想解决的话, 只能自动换IP或者慢慢下载了.