仓库源文站点原文


layout: post title: "Advanced AI Week2 notes" date: 2020-01-11 20:11:01 +0800

categories: [Advanced AI]

Information Theory & Kolmogorov Complexity

Philosophical Issues

We need a general & formal & complete & consistent theory for induction & prediction.

A unique principle for prediction.
We have Occam's razor.
Problem: Not a formal/mathematical objective principle.

Definitions & Notation

$i, k, n, t \in \mathbb{N} = \{ 1, 2, 3,... \} \\ \mathbb{B} = \{0, 1\} \\ x, y, z \in \mathbb{B}^* \\ \omega \in \mathbb{B}^\infty$
$\epsilon$ is empty string
$1^n$ the string of $n$ ones
$\ell (x)$ the length of string $x$
$xy = x \circ y$ the concatenation of string $x$ with $y$

Every countable set is $\cong \mathbb{N}$.
Interpret string with binary.
Problem: Not unique. Use bijection between $\mathbb{N}$ and $\mathbb{B}^*$.
First-order prefix coding $\bar{x} := 1^{\ell (x)}0x.$
Second-order prefix coding $x` := \overline{\ell (x)} x$.

$\ell (x) \leq \log (x + 1)$
$\ell (\bar{x}) \leq 2\log x$
$\ell (x`) \leq \log x + 2\log \log x$

String $x$ is prefix of $y :\Leftrightarrow \exists z (\neq \epsilon)$ such that $xz = y$.
Set $\mathcal{P}$ is prefix-free = no element is prefix of another.

Theorem 2.1 (Kraft Inequality)
For a prefix code $\mathcal{P}$ we have $\sum_{x \in \mathcal{P}}2^{-\ell (x)} \leq 1$.

Set of $\bar{x}$ is a prefix code.
Set of $x`$ is a prefix code.
Pair string $\langle x, y \rangle = x`y$.
$f(x, y) = f(x`y)$

Turing Machines

Thesis 2.2 (Turing)
Everything that can be reasonably said to be computable by a human using a fixed procedure can also be computed by a Turing machine.

Thesis 2.3 (Church)
The class of algorithmically computable numerical functions (in the intuitive sense) coincides with the class of partial recursive functions.

Assumption 2.4 (Short compiler)
Given two natural Turing-equivalent formal systems F1 and F2, then there always exists a single short program on F2 which is capable of interpreting all F1-programs.

Definition 2.5 (Prefix TM (pTM))

1. one unidirectional read-only input tape  
2. one unidirectional write-only input tape  
3. some bidirectional work tapes, initially filled with zeros  
4. all tapes are binary  
5. T halts on input p with output x  
6. {$p : \exists x: T(p) = x$} forms a prefix code  
7. $p$ is self-delimiting program  

Definition 2.6 (Monotone Turing Machine)
1~4

5. T outputs/computes a string starting with x on input p  
6. T may continue operation and need not to halt  
7. {$p : T(p) = x^\*$} forms a prefix code  
8. $p$ is minimal program  

Theorem 2.7 (Universal prefix/monotone Turing machine $U$)
which simulates (any) pTM/mTM $T_i$ with input $y`q$ if fed with input $y`i`q$, i.e.
$$U(yiq) = T_i(yq) \quad \forall y,i,q$$ For $p \neq y\i`q, U(p)$ outputs nothing. $y$ is side information.

Theorem 2.8 (Halting Problem)
There is no TM $T: T(i`p) = 1 \Leftrightarrow T_i(p))$ does not halt.

Kolmogorov Complexity

Theorem 2.9 (Universality/Minimality of $K_U$)
$K_U(x) \leq KT(x) + c{TU}$,
where $c_{TU} <^+ K_U(T) < \infty$ is independent of $x$

Definition 2.10 ((conditional) prefix Kolmogorov complexity)
= shortest program $p$, for which reference $U$ outputs $x$ (given $y$):
$K(x) := \min{p} \{\ell (p) : U(p) = x\},$
$K(x|y) := \min
{p} \{\ell (p) : U(y`p) = x\}$

Theorem 2.11 (Upper Bound on $K$)
$K(x) <^+ \ell (x) + 2 \log \ell (x), \quad K(n) <^+ \log n + 2\log \log n$

Theorem 2.12 (Lower bound for most $n$, Kraft inequality)
$\sum_{x \in \mathbb{B}^*} 2^{-K(x)} \leq 1, \quad K(x) \geq l(x)$ for 'most' $x$
$K(x) \to \infty$ for $n \to \infty$

Theorem 2.13 (Extra Information)
$K(x | y) <^+ K(x) <^+ K(x, y)$

Theorem 2.14 (Subadditivity)
$K(xy) <^+ K(x, y) <^+ K(x) + K(y | x) <^+ K(x) + K(y)$

Theorem 2.15 (Symmetry of Information)
$K(x | y, K(y)) + K(y) =^+ K(x, y) =^+ K(y, x) =^+ K(y |x, K(x)) + K(x)$

Theorem 2.16 (Information Non-Increase)
$K(f(x)) <^+ K(x) + K(f)$ for recursive $f: \mathbb{B}^* \to \mathbb{B}^*$

Theorem 2.17 (Probability coding / MDL)
$K(x) <^+ -\log P(x) + K(P)$
if $P : \mathbb{B}^* \to [0, 1]$ is enumerable and $\sum_{x \in \mathbb{B}^*} P(x) \leq 1$

Definition 2.18 (Definition of Shannon entropy)
$Entropy(X) \equiv H(X) := -\sum{x \in X} P(x) \log P(x)$
$Entropy(X|Y) \equiv H(X|Y) := -\sum
{y \in Y} P(y) \sum_{x \in X} P(x|y) \log P(x|y)$

Theorem 2.19 (Properties of Shannon entropy)
Upper bound: $H(X) \leq \log |X| = n$ for $X = \mathbb{B}^n$
Extra information: $H(X|Y) \leq H(X) \leq H(X,Y)$
Subadditivity: $H(X,Y) \leq H(X) + H(Y)$
Symmertry: $H(X|Y) + H(Y) = H(X,Y) = H(Y,X)$
Information non-increase: $H(f(X)) \leq H(X)$ for any $f$

Theorem 2.20 (Monotone Kolmogorov Complexity Km)
$Km(x) := \min_{p} {\ell (p) : U(p) = x* }$
$Km(x) <^+ \ell (x)$
$Km(xy) \geq Km(x) \in \mathbb{N}_0$
$Km(x) <^+ -\log \mu (x) + K(\mu)$ if $\mu$ comp. measure

Computability Concepts

Definition 2.21 (Computable functions)
We consider functions $f : \mathbb{N} \to \mathbb{R} :$
$f$ is finitely computable or recursive $iff$ there are Turing machines $T_{1/2}$ with output interpreted as natural numbers and $f(x) = \frac{T_1 (x)}{T2 (x)}$,
$f$ is estimable $iff \exists$ recursive $\phi (\cdot , \cdot) \forall \varepsilon > 0 : |\phi (x, \lfloor \frac{1}{\varepsilon} \rfloor) - f(x)| < \varepsilon \forall x$.
$f$ is lower semicomputable or enumerable $iff \phi (\cdot , \cdot)$ is recursive and $\lim
{t \to \infty} \phi (x, t) = f(x)$ and $\phi (x, t) \leq \phi (x, t + 1)$.
$f$ is approximable $iff \phi (\cdot , \cdot)$ is recursive and $\lim_{t \to \infty} \phi (x, t) = f(x)$

Theorem 2.22 ((Non)computability of K and Km Complexity)
Thre prefix complexity $K : \mathbb{B}^ \to \mathbb{N}$ and the monotone complexity $Km : \mathbb{B}^ \to \mathbb{N}$ are co-enumerable, but not finitely computable.