4 Mutual information 5 Discrete memoryless channels 6 Entropy and mutual information for continuous RRVV 7 AWGN channel capacity theorem 8 Bridging towards channel coding 9 Conclusions 10 References Francisco J. Escribano Block 2: Introduction to Information Theory January 24, 2018 2 / 62
2nd half of the XXth Century. It relies on solid mathematical foundations [1, 2, 3, 4]. It tries to address two basic questions: ◮ To what extent can we compress data for a more efficient usage of the limited communication resources? → Entropy ◮ Which is the largest possible data transfer rate for given resources and conditions? → Channel capacity Key concepts for Information Theory are thus entropy (H (X)) and mutual information (I (X; Y)). ◮ X, Y are random variables (RRVV) of some kind. ◮ Information will naturally be characterized by means of RRVV. Francisco J. Escribano Block 2: Introduction to Information Theory January 24, 2018 4 / 62
in telecommunications that the error rate increased with increasing data rate. ◮ Claude Shannon demonstrated that errorfree transmission may be possible under certain conditions. Information Theory provides strict bounds for any communication system. ◮ Maximum compression → H (X); or minimum I X; ˆ X , with controlled distortion. ◮ Maximum transfer rate → maximum I (X; Y). ◮ Any given communication system works between said limits. The mathematics behind is not always constructive, but provides guidelines to design algorithms to improve communications given a set of available resources. ◮ The resources in this context are parameters such as available transmission power, available bandwidth, signal-to-noise ratio and the like. eplacements X Y X ˆ X Channel Compression Francisco J. Escribano Block 2: Introduction to Information Theory January 24, 2018 5 / 62
a symbol from a given set at a given symbol rate, chosen randomly and independently from the previous and the subsequent ones. ζ = {s0, · · · , sK−1 } , P (S = sk ) = pk , k = 0, 1, · · · , K − 1; K is the source radix Information quantity is a random variable defined as I (sk ) = log2 1 pk with properties I (sk ) = 0 if pk = 1 I (sk ) > I (si ) if pk < pi I (sk ) ≥ 0, 0 ≤ pk ≤ 1 I (sl , sk ) = I (sl ) + I (sk ) Definition is conventional but also convenient: it relates uncertainty and information. Francisco J. Escribano Block 2: Introduction to Information Theory January 24, 2018 7 / 62
“information content”, and it is defined as H (ζ) = E{pk } [I (sk )] = K−1 k=0 pk · I (sk ) H (ζ) = K−1 k=0 pk · log2 1 pk 0 ≤ H (ζ) ≤ log2 (K) pj = 1 ∧ pk = 0, k = j H (ζ) = 0 pk = 1 K , k = 0, · · · , K − 1 H (ζ) = log2 (K) There is no potential information (uncertainty) in deterministic (“de- generated”) sources. Entropy is conventionally measured in “bits”. Francisco J. Escribano Block 2: Introduction to Information Theory January 24, 2018 8 / 62
the issues related to handling the output data from a given source, from the point of view of Information Theory. One of the main issues is data compression, its theoretical limits and the related practical algorithms. ◮ The most important prerequisite in communications at the PHY is to keep data integrity → any transformation has to be fully invertible (loss- less compression). ◮ Lossy compression is a key field at other communication levels: for ex- ample, compressing a stream of video to MPEG-2 standard, to create a transport frame. ◮ In other related fields, like cryptography, some non-invertible algorithms for data compression are of utmost interest (i.e. hash functions, to create almost-unique tokens). Francisco J. Escribano Block 2: Introduction to Information Theory January 24, 2018 11 / 62
data from the source by asigning to each symbol sk a corresponding codeword (binary, in our case). The aim of this source coding process is to try to represent the source symbols more efficiently. ◮ We focus on variable-length binary, invertible codes → the codewords are unique blocks of binary symbols of length lk , for each symbol sk . ◮ The correspondence codeword ↔ original symbol constitutes a code. Average codeword length: L = E [lk ] = K−1 k=0 pk · lk Coding efficiency: η = Lmin L ≤ 1 Francisco J. Escribano Block 2: Introduction to Information Theory January 24, 2018 12 / 62
limits for lossless data compression [5]. ◮ N iid1 random variables with entropy H (ζ) each, can be compressed into N · H (ζ) bits with negligible information loss risk as N → ∞. If they are compressed into less than N · H (ζ) bits it is certain that some information will be lost. In practical terms, for a single random variable, this means Lmin ≥ H (ζ) And, therefore, the coding efficiency can be defined as η = H(ζ) L Like other results in Information Theory, this theorem provides the limits, but does not tell how actually we can reach them. 1 Independent and identically distributed. Francisco J. Escribano Block 2: Introduction to Information Theory January 24, 2018 13 / 62
provides a practical algorithm to perform source cod- ing within the limits shown. It is an instance of a class of codes, called prefix codes ◮ No binary word within the codeset is the prefix of any other one. Properties: ◮ Unique coding. ◮ Instantaneous decoding. ◮ The lengths lk meet the Kraft-McMillan inequality [2]: K−1 k=0 2−lk ≤ 1. ◮ L of the code is bounded by H (ζ) ≤ L < H (ζ) + 1 H (ζ) = L ↔ pk = 2−lk Meeting the Kraft-McMillan inequality guarantees that a prefix code with the given lk can be constructed. Francisco J. Escribano Block 2: Introduction to Information Theory January 24, 2018 14 / 62
List symbols sk in a column, in order of decreasing probabilities. 2 Compose the probabilities of the last 2 symbols: probabilities are added into a new dummy compound value/symbol. 3 Reorder the probabilities set in an adjacent column, putting the new dummy symbol probability as high as possible, retiring values involved. 4 In the process of moving probabilities to the right, keep the values of the unaffected symbols (making room to the compound value if needed), but assign a 0 to one of the symbols affected, a 1 to the other (top or bottom, but keep always the same criterion along the process). 5 Start afresh the process in the new column. 6 When only two probabilities remain, assign last 0, 1 and stop. 7 Write the binary codeword corresponding to each original symbol by trac- ing back the trajectory of each of the original symbols and the dummy symbols they take part in, from the last towards the initial column. Francisco J. Escribano Block 2: Introduction to Information Theory January 24, 2018 15 / 62
coding: assigning final codeword patterns proceeds from right to left. To characterize the resulting code, it is important to calculate: H (ζ) ; L = K−1 k=0 pk · lk ; η = H(ζ) L ; σ2 = K−1 k=0 pk · lk − L 2 Francisco J. Escribano Block 2: Introduction to Information Theory January 24, 2018 16 / 62
of view of data communication (sequences of data symbols), it is useful to define the n-th extension of the source ζ: ◮ A symbol of the n-th extension of the source is built by taking n succes- sive symbols from it (symbols in ζn). ◮ If source is iid, for si ∈ ζ, the new extended source ζn is characterized by the probabilities: P (σ = (s1 · · · sn ) ∈ ζn) = n i=1 P (si ∈ ζ) ◮ Given that the sequence si in σ ∈ ζn is independent and identically distributed (iid) for any n, the entroy of this new source is: H (ζn) = n · H (ζ) By applying Huffman coding to ζn, we derive a code with average length L characterized by: H (ζn) ≤ L < H (ζn) + 1 Francisco J. Escribano Block 2: Introduction to Information Theory January 24, 2018 17 / 62
word length per original symbol Ls = L n → H (ζ) ≤ Ls < H (ζ) + 1 n The final efficiency turns out to be η = H(ζn) L = n·H(ζ) L = H(ζ) Ls Efficiency tends thus to 1 as n → ∞, as shown by the previous inequal- ity. The method of the n-th extension of the source, nevertheless, is ren- dered impractical for large n, due to: ◮ Increasing delay. ◮ Increasing coding and decoding complexity. Francisco J. Escribano Block 2: Introduction to Information Theory January 24, 2018 18 / 62
to 2 RRVV. ◮ 2 or more RRVV are needed when analyzing communication channels from the point of view of Information Theory. ◮ These RRVV can also be seen as a random vector. ◮ The underlying concept is the characterization of channel input vs chan- nel output, and what we can get about the former by observing the latter. Joint entropy of 2 RRVV: H (X, Y) = x∈X y∈Y p (x, y) · log2 1 p(x,y) H (X, Y) = Ep(x,y) log2 1 P(X,Y) Properties: H (X, Y) ≥ max [H (X) , H (Y)] ; H (X, Y) ≤ H (X) + H (Y) Francisco J. Escribano Block 2: Introduction to Information Theory January 24, 2018 20 / 62
(Y|X) = x∈X p (x) · H (Y|X = x) = = x∈X p (x) y∈Y p (y|x) · log2 1 p(y|x) = = x∈X y∈Y p (x, y) · log2 1 p(y|x) = Ep(x,y) log2 1 P(Y|X) H (Y|X) is a measure of the uncertainty in Y once X is known. Properties: H (Y|Y) = 0; H (Y|X) = H (Y) if X and Y are independent Francisco J. Escribano Block 2: Introduction to Information Theory January 24, 2018 21 / 62
of 2 RRVV: H (X, Y) = H (X) + H (Y|X) . The expression points towards the following common wisdom result: “joint knowledge about X and Y is the knowledge about X plus the information in Y not related to X”. Proof: p (x, y) = p (x) · p (y|x) ; log2 (p (x, y)) = log2 (p (x)) + log2 (p (y|x)) ; Ep(x,y) [log2 (P (X, Y))] = Ep(x,y) [log2 (P (X))] + +Ep(x,y) [log2 (P (Y|X))] . Corollary: H (X, Y|Z) = H (X|Z) + H (Y|X, Z). As another consequence: H (Y|X) ≤ H (Y). Francisco J. Escribano Block 2: Introduction to Information Theory January 24, 2018 22 / 62
information needed to describe the RV X on average. The relative entropy D (p q) measures the increase in information needed to describe X when it is characterized by means of a distribution q (x) instead of p (x). X; p (x) → H (X) X; q (x) → H (X) + D (p q) Definition of relative entropy (aka Kullback-Leibler divergence, or improperly “distance”): D (p q) = x∈X p (x) · log2 p(x) q(x) = Ep(x) log2 P(X) Q(X) Note that: limx→0 (x · log (x)) = 0; 0 · log 0 0 = 0; p · log p 0 = ∞. Francisco J. Escribano Block 2: Introduction to Information Theory January 24, 2018 23 / 62
entropy: 1 D (p q) ≥ 0. 2 D (p q) = 0 ↔ p (x) = q (x). 3 It is not symmetric. Therefore, it is not a true distance from the metrical point of view. Mutual information of 2 RRVV: I (X; Y) = H (X) − H (X|Y) = = x∈X y∈Y p (x, y) · log2 p(x,y) p(x)·p(y) = = D (p (x, y) p (x) · p (y)) = Ep(x,y) log2 P(X,Y) P(X)·P(Y) . The mutual information between X and Y is the information in X, minus the information in X not related to Y. Francisco J. Escribano Block 2: Introduction to Information Theory January 24, 2018 24 / 62
I (X; Y) = I (Y; X). 2 Non negative: I (X; Y) ≥ 0. 3 I (X; Y) = 0 ↔ X and Y are independent. } rag replacements H (Y|X) I (X; Y) H (X) H (Y) H (X, Y) H (X|Y) Figure 3: Relationship between entropy and mutual information. Francisco J. Escribano Block 2: Introduction to Information Theory January 24, 2018 25 / 62
have defined mutual information, we can examine lossy compression under the so-called Rate-Distortion Theory. A source X is compressed into a new source ˆ X, a process characterized by the probabilities p (x) ; p (ˆ x, x) = p (ˆ x|x) p (x) The quality of the compression is measured according to a distortion parameter d (x, ˆ x) and its expected value D: D = Ep(ˆ x,x) [d (x, ˆ x)] = x ˆ x d (x, ˆ x) p (ˆ x|x) p (x) d (x, ˆ x) is any convenient distance measurement ◮ The Euclidean distance, d (x, ˆ x) = |x − ˆ x|2. ◮ The Hamming distance, d (x, ˆ x) = 1 if x = ˆ x, and 0 otherwise. Francisco J. Escribano Block 2: Introduction to Information Theory January 24, 2018 26 / 62
we want ˆ X to be represented by the lowest number of bits, and this means finding a distribution p (ˆ x|x) such that R = minp(ˆ x|x) I X; ˆ X = minp(ˆ x|x) H (X) − H X|ˆ X R is the rate in bits of the compressed source. The source is compressed to an entropy lower than H (X): since the process is lossy, H X|ˆ X = 0. The minimum is trivially 0 when X and ˆ X are fully unrelated, so that the minimization is performed subject to D ≤ D∗, for a maximum admissible distortion D∗. Conversely, the problem can be addressed as the minimization of the distortion D subject to I X; ˆ X ≤ R∗, for a maximum admissible rate R∗ < H (X). Francisco J. Escribano Block 2: Introduction to Information Theory January 24, 2018 27 / 62
system where the output is a probabilistic function of the input. A DMC (Discrete Memoryless Channel) consist in ◮ Input alphabet X = {x0 , x1 , · · · , xJ−1 }, corresponding to RV X. ◮ Output alphabet Y = {y0 , y1 , · · · , yK−1 }, corresponding to RV Y, noisy version of RV X. ◮ A set of transition probabilities linking input and output, following {p (yk |xj )} k=0,1,··· ,K−1; j=0,1,···J−1 p (yk |xj ) = P (Y = yk |X = xj ) ◮ X and Y are finite and discrete. ◮ The channel is memoryless, since the output symbol depends only on the current input symbol. Francisco J. Escribano Block 2: Introduction to Information Theory January 24, 2018 29 / 62
× K P = p (y0|x0) p (y1|x0) · · · p (yK−1 |x0) p (y0|x1) p (y1|x1) · · · p (yK−1 |x1) . . . . . . ... . . . p (y0|xJ−1 ) p (y1|xJ−1 ) · · · p (yK−1 |xJ−1 ) The channel is said to be symmetric when each column is a permutation of any other, and each row is a permutation of any other. Important property K−1 k=0 p (yk |xj ) = 1, ∀ j = 0, 1, · · · , J − 1. Francisco J. Escribano Block 2: Introduction to Information Theory January 24, 2018 30 / 62
determined by the input (a priori) proba- bilities and the channel matrix, following pX = (p (x0) , p (x1) , · · · , p (xJ−1 )), and p (xj ) = P (X = xj ) pY = (p (y0) , p (y1) , · · · , p (yK−1 )), and p (yk ) = P (Y = yk ) p (yk ) = J−1 j=0 p (yk |xj ) · p (xj ), ∀ k = 0, 1 · · · , K − 1 Therefore, pY = pX · P When J = K and yj is the correct choice when sending xj , we can calculate the average symbol error probability as Pe = J−1 j=0 p (xj ) · K−1 k=0,k=j p (yk |xj ) The probability of correct transmission is 1 − Pe . Francisco J. Escribano Block 2: Introduction to Information Theory January 24, 2018 31 / 62
channel: modulation with hard deci- sion. cements xj yk Figure 4: 16-QAM transmitted constellation. PSfrag replacements p(yk |xj ) Figure 5: 16-QAM received constellation with noise. Francisco J. Escribano Block 2: Introduction to Information Theory January 24, 2018 32 / 62
→ the binary erasure channel, typ- ical of storage systems. ◮ Reading data in storage systems can also be modeled as sending infor- mation through a channel with given probabilistic properties. ◮ An erasure marks the complete uncertainty over the symbol read. ◮ There are methods to recover from erasures, based on the principles of Information Theory. acements x0=0 x1=1 y0=0 y1=ǫ y2=1 1−p 1−p p p Figure 6: Diagram showing the binary erasure channel. P = 1 − p p 0 0 p 1 − p Francisco J. Escribano Block 2: Introduction to Information Theory January 24, 2018 33 / 62
and pX . Characterizing the possibil- ities of the channel requires removing dependency with pX . Channel capacity is defined as C = maxpX [I (X; Y)] ◮ It is the maximum average mutual information enabled by the channel, in bits per channel use, and the best we can get out of it in point of reliable information transfer. ◮ It only depends on the channel transition probabilities P. ◮ If the channel is symmetric, the distribution pX that maximizes I (X; Y) is the uniform one (equiprobable symbols). Channel coding is a process where controled redundancy is added to protect data integrity. ◮ A channel encoded information sequence is represented by a block of n bits, encoded from an information sequence represented by k ≤ n bits. ◮ The so-called code rate is calculated as R = k/n ≤ 1. Francisco J. Escribano Block 2: Introduction to Information Theory January 24, 2018 34 / 62
ζ, emitting symbols with period Ts . ◮ The binary information rate of such source is H(ζ) Ts (bits/s). Consider a discrete memoryless channel, used to send coded data each Tc seconds. ◮ The maximum possible data transfer rate would be C Tc (bits/s). The noisy-channel coding theorem states the following [5]: ◮ If H(ζ) Ts ≤ C Tc , there exists a coding scheme that guarantees errorfree transmission (i.e. Pe arbitrarily small). ◮ Conversely, if H(ζ) Ts > C Tc , the communication cannot be made reliable (i.e. we cannot have a bounded Pe , so small as desired). Please note that again the theorem is asymptotic, and not constructive: it does not say how to actually reach the limit. Francisco J. Escribano Block 2: Introduction to Information Theory January 24, 2018 35 / 62
channel. Consider a binary source ζ = {0, 1}, with equiprobable symbols. ◮ H (ζ) = 1 info bits/“channel use”. ◮ Source works at a rate of 1 Ts “channel uses”/s, and H(ζ) Ts info bits/s. Consider an encoder with rate k n info/coded bits, and 1 Tc “channel uses”/s. ◮ Note that k n = Tc Ts . Maximum achievable rate is C Tc coded bits/s. If H(ζ) Ts = 1 Ts ≤ C Tc , we could find a coding scheme so that Pe is made arbitrarily small (so small as desired). ◮ This means that an appropriate coding scheme has to meet k n ≤ C in order to exploit the possibilities of the noisy-channel coding theorem. Francisco J. Escribano Block 2: Introduction to Information Theory January 24, 2018 36 / 62
that, if a bit error probability of Pb is accept- able, coding rates up to R (Pb ) = C 1−H(Pb ) are achievable. R greater than that cannot be achieved with the given bit error probability. In a binary symmetric channel without noise (error probability p = 0), it can be demonstrated C = maxpX [I (X; Y)] = 1 bits/“channel use”. rag replacements x0=0 x1=1 y0=0 y1=1 1 1 Figure 7: Diagram showing the errorfree binary symmetric channel. P = 1 0 0 1 Francisco J. Escribano Block 2: Introduction to Information Theory January 24, 2018 37 / 62
channel with error probability p = 0, C = 1 − H (p) = = 1 − p · log2 1 p + (1 − p) · log2 1 1−p bits/“channel use”. replacements x0=0 x1=1 y0=0 y1=1 1−p 1−p p p Figure 8: Diagram showing the binary symmetric channel -BSC(p). P = 1 − p p p 1 − p In the binay erasure channel with erasure probability p, C = maxpX [I (X; Y)] = 1 − p bits/“channel use”. Francisco J. Escribano Block 2: Introduction to Information Theory January 24, 2018 38 / 62
entropy or continuous entropy of a continuous RV X with pdf fX (x) is defined as h (X) = ∞ −∞ fX (x) · log2 1 fX(x) dx It does not measure an absolute quantity of information, hence the differential term (it could take values lower than 0). The differential entropy of a continuous random vector − → X with joint pdf f− → X − → x is defined as h − → X = − → x f− → X − → x · log2 1 f− → X (− → x ) d− → x − → X = (X0, · · · , XN−1 ) ; f− → X − → x = f− → X (x0, · · · , xN−1 ) Francisco J. Escribano Block 2: Introduction to Information Theory January 24, 2018 40 / 62
a given variance value σ2, the Gaussian RV exhibits the largest achievable differential entropy. ◮ This means the Gaussian RV has a special place in the domain of con- tinuous RRVV within Information Theory. Properties of differential entropy: ◮ Differential entropy is invariant under translations h (X + c) = h (X) ◮ Scaling h (a · X) = h (X) + log2 (|a|) h A · − → X = h − → X + log2 (|A|) ◮ For given variance σ2, a Gaussian RV X with variance σ2 X = σ2 and any other RV Y with variance σ2 Y = σ2, h (X) ≥ h (Y). ◮ For a Gaussian RV X with variance σ2 X h (X) = 1 2 log2 2π e σ2 X Francisco J. Escribano Block 2: Introduction to Information Theory January 24, 2018 41 / 62
continuous RRVV Mutual information of two continuous RRVV X and Y: I (X; Y) = ∞ −∞ ∞ −∞ fX,Y (x, y) · log2 fX(x|y) fX(x) dxdy fX,Y (x, y) is X and Y joint pdf, and fX (x|y) is the conditional pdf of X given Y = y. Properties: ◮ Symmetry, I (X; Y) = I (Y; X). ◮ Non-negative, I (X; Y) ≥ 0. ◮ I (X; Y) = h (X) − h (X|Y). ◮ I (X; Y) = h (Y) − h (Y|X). h (X|Y) = ∞ −∞ ∞ −∞ fX,Y (x, y) · log2 1 fX(x|y) dxdy This is the conditional differential entropy of X given Y. Francisco J. Escribano Block 2: Introduction to Information Theory January 24, 2018 42 / 62
discrete memoryless channel, described by ◮ x(t) is a stocastic stationary process, with mx = 0 and bandwidth Wx = B Hz. ◮ This process is sampled with sampling period Ts = 1 2B , and Xk = x (k · Ts ) are thus a bunch of continuous RRVV ∀ k, with E [Xk ] = 0. ◮ A RV Xk is transmitted each Ts seconds over a noisy channel with bandwidth B, during a total of T seconds (n = 2BT total samples). ◮ The channel is AWGN, adding noise samples described by RRVV Nk with mn = 0 and Sn (f ) = N0 2 , so that σ2 n = N0B. ◮ The received samples are statistically independent RRVV, described as Yk = Xk + Nk . ◮ The cost function for any maximization of the mutual information is the signal power E X2 k = S. Francisco J. Escribano Block 2: Introduction to Information Theory January 24, 2018 44 / 62
is defined as C = maxfXk (x) I (Xk ; Yk ) : E X2 k = S ◮ I (Xk ; Yk ) = h (Yk ) − h (Yk |Xk ) = h (Yk ) − h (Nk ) ◮ Maximum is only reached if h (Yk ) is maximized. ◮ This only happens if fXk (x) is Gaussian! ◮ Therefore, C = I (Xk ; Yk ) with Xk Gaussian and E X2 k = S. E Y2 k = S + σ2 n , then h (Yk ) = 1 2 log2 2π e S + σ2 n . h (Nk ) = 1 2 log2 2π e σ2 n . C = I (Xk ; Yk ) = h (Yk ) − h (Nk ) = 1 2 log2 1 + S σ2 n bits/“channel use”. Finally, C (bits/s) = n T · C (bits/“channel use”) C = B · log2 1 + S N0B bits/s; note that n T = 1 Ts “channel use”/s Francisco J. Escribano Block 2: Introduction to Information Theory January 24, 2018 45 / 62
that the capacity of a bandlim- ited AWGN channel with bandwidth B and power spectral density N0/2 is C = B · log2 1 + S N0B bits/s This is the highest possible information transmission rate over this ana- log communication channel, accomplished with arbitrarily small error probability. Capacity increases (almost) linearly with B, whereas S determines only a logarithmic increase. ◮ Increasing available bandwidth has far larger impact on capacity than increasing transmission power. The bandlimited, power constrained AWGN channel is a very conve- nient model for real-world communications. Francisco J. Escribano Block 2: Introduction to Information Theory January 24, 2018 46 / 62
Consider an ideal system where Rb = C. S = Eb · C, where Eb is the average bit energy. C B = log2 1 + Eb N0 C B → Eb N0 = 2 C B −1 C B If we represent the spectral efficiency η = Rb B as a function of Eb N0 , the previous expression is an asymptotic curve on such plane that marks the border between the reliable zone, and the unrealiable zone. ◮ This curve helps us to identify the parameter set for a communication system so that it may achieve reliable transmission with a given quality (measured in terms of a limited, maximum error rate). When B → ∞, Eb N0 ∞ = ln (2) = −1.6 dB. ◮ This limit is known as the “Shannon limit” for the AWGN channel. ◮ The capacity in the limit is C ∞ = S N0 log2 (e). Francisco J. Escribano Block 2: Introduction to Information Theory January 24, 2018 47 / 62
regions as determined by the Shannon-Hartley theorem (source: www.gaussianwaves.com). No channel coding applied. The diagram illustrates possible tradeoffs, involving Eb N0 , Rb B and Pb . Note that the spectral efficiency of the modulations is given for optimal pulse shap- ing. Francisco J. Escribano Block 2: Introduction to Information Theory January 24, 2018 48 / 62
hand side of the plot is the so-called power limited region. ◮ There, the Eb N0 is very poor and we have to sacrifice spectral efficiency to get a given transmission quality (Pb ). ◮ An example of this are deep space communications, where the SNR received is extremely low due to the huge free space losses in the link. The only way to get a reliable transmission is to drop data rate at very low values. The upper, right hand side of the plot is de so-called bandwidth limited region. ◮ There, the desired spectral efficiency Rb B for fixed B (desired data rate) is traded-off against unconstrained transmission power (unconstrained Eb N0 ), under a given Pb . ◮ An example of this would be a terrestrial DVB transmitting station, where Rb B is fixed (standarized), and where the transmitting power is only limited by regulatory or technological constraints. Francisco J. Escribano Block 2: Introduction to Information Theory January 24, 2018 49 / 62
considered, on the one hand, discrete input/output memory- less channels, and, on the other hand, purely Gaussian channels without paying attention to the input source. Normal mode of operation in digital communications is characterized by a discrete source (digital modulation, with symbols si ), that goes through an AWGN channel (that adds noise samples n). Therefore, it makes sense to consider real digital communication pro- cesses as instances of Discrete-input Continuous-output memory- less channels (DCMC). It may be demonstrated that, for an M-ary digital modulation going through an AWGN channel, with outputs r = si + n, the channel capacity is C = log2 (M) − T Eb N0 bits per “channel use” where T (·) is a funtion that has to be evaluated numerically. Francisco J. Escribano Block 2: Introduction to Information Theory January 24, 2018 51 / 62
2 4 6 8 10 12 0 0.2 0.4 0.6 0.8 1 1.2 E b /N 0 (dB) Spectral efficiency (bps/Hz) DCMC BPSK Capacity Hard decided (DMC) BPSK Capacity AWGN Capacity Figure 10: Capacity of BPSK DCMC against hard decided BPSK. For low Eb /N0 , there is higher capacity available for DCMC. As Eb /N0 grows, DCMC and DMC capacities converge. There is around 1 − 2 dB to be gained with the DCMC in the Eb /N0 range of interest. Francisco J. Escribano Block 2: Introduction to Information Theory January 24, 2018 52 / 62
Guaranteeing a given target Pb further limits the attainable zone in the spectral efficiency/signal-to-noise ratio plane, depending on the framework chosen. For fixed spectral efficiency (fixed Rb B ), we move along a horizontal line where we manage the Pb versus Eb N0 tradeoff. For fixed signal-to-noise ratio (fixed Eb N0 ), we move along a vertical line where we manage the Pb versus Rb B tradeoff. Figure 11: Working regions for given transmission schemes. Capacities are for DCMC (source:www.comtechefdata.com). Francisco J. Escribano Block 2: Introduction to Information Theory January 24, 2018 53 / 62
hood (ML) demodulation: ◮ Apply decision region criterion. ◮ Decided symbols are converted back into mapped bits. ◮ The hard demodulator thus outputs a sequence of decided bits. Hard decision ML demodulation fails to capture the extra infor- mation contained in the received constellation ◮ Distance from the received symbol to the hard decided symbol may pro- vide a measure of the uncertainty and reliabitily on the decision. PSfrag replacements p(yk |xj ) Figure 12: 16-QAM received constellation with noise. How could we take advantage of the available extra information? Francisco J. Escribano Block 2: Introduction to Information Theory January 24, 2018 54 / 62
information about the channel and about the trans- mitted modulation to build new significant demodulation values. Consider a standard modulation and an AWGN channel r = si + n where si is the symbol transmitted in the signal space and n is the AWGN sample corresponding to the vector channel. Test hypothesis for demodulation: P (bk = b|r) = fr(r|bk =b)P(bk =b) fr(r) ; P (bk = b|r) = P(bk =b) fr(r) sj /bk =b fr (r|sj) = P(bk =b) fr(r) sj /bk =b 1 √ 2πσ2 n e−|sj −r|2/2σ2 n Note that fr (r) does not depend on the hypothesis bk = b, b = 0, 1. Francisco J. Escribano Block 2: Introduction to Information Theory January 24, 2018 55 / 62
equiprobable, we may resort to the quotient λk = log P(bk =1|r) P(bk =0|r) = log fr(r|bk =1) fr(r|bk =0) = = log sj /bk =1 e−|sj −r|2 /2σ2 n sj /bk =0 e−|sj −r|2 /2σ2 n Note that fr (r|bk = b) is a likelihood function. λk is called the Log-Likelihood Ratio (LLR) on bit bk . ◮ Its sign could be used in hard decision to decide whether bk is more likely to be a 1 (λk > 0) or a 0 (λk < 0). ◮ The modulus of λk could be used as a measure of the reliability of the hypothesis on bk . ◮ A larger |λk | means a larger reliability on a possible decision. Francisco J. Escribano Block 2: Introduction to Information Theory January 24, 2018 56 / 62
basic knowledge about the transmission system and the channel may lead to the calculation of probabilistic (soft) values related to a possible decision, under suitable hypothesis. The calculated LLR values for the received bit stream may provide valuable insight and information for the next subsystem in the receiver chain: the channel decoder. If input bits are not equiprobable, we may still used a soft value char- acterization, in the form of quotient of demodulation posteriori values log P (bk = 1|r) P (bk = 0|r) = log P (bk = 1) P (bk = 0) + log fr (r|bk = 1) fr (r|bk = 0) where the LLR is updated with the a priori probabilities P (bk = b). We are going to see in the next lesson how soft demodulated values play a key role in the most powerful channel encoding/decoding methods. Francisco J. Escribano Block 2: Introduction to Information Theory January 24, 2018 57 / 62
appli- cations in communications, artificial intelligence, data mining, machine learning, robotics... We have seen three fundamental results from Shannon’s 1948 seminal work, that constitute the foundations of all modern communications. ◮ Source coding theorem, that states the limits and possibilities of loss- less data compresion. ◮ Noisy-channel coding theorem, that states the need of channel coding techniques to achieve a given performance, using constrained resources. It establishes the asymptotic possibility of errorfree transmission over discrete-input discrete-output noisy channels. ◮ Shannon-Hartley theorem, which establishes the absolute (asymp- totic) limits for errorfree transmission over AWGN channels, and de- scribes the different tradeoffs involved among the given resources. Francisco J. Escribano Block 2: Introduction to Information Theory January 24, 2018 59 / 62
for practical and feasible communication systems, managing and trading-off constrained resources and under a given target performance (BER). ◮ The η = Rb B against Eb N0 plane (under a target BER) constitutes the playground for designing and bringing into practice any communication standards. ◮ Any movement over the plane has a direct impact over business and revenues in the telco domain. When addressing practical designs in communications, these results and limits are not much heeded, but they underlie all of them. ◮ There are lots of common practice and common wisdom rules of thumb in the domain, stating what to use when (regarding modulations, channel encoders and so on). ◮ Nevertheless, optimizing the designs so as to profit as much as possible from all the resources at hand require making these limits explicit. Francisco J. Escribano Block 2: Introduction to Information Theory January 24, 2018 60 / 62
coding (course). Stanford University, 2010. [Online]. Available: http://web.stanford.edu/group/cioffi/book [2] T. M. Cover and J. A. Thomas, Elements of Information Theory. New Jersey: John Wiley & Sons, Inc., 2006. [3] D. MacKay, Information Theory, Inference, and Learning Algorithms. Cambridge University Press, 2003. [Online]. Available: http://www.inference.phy.cam.ac.uk/mackay/itila/book.html [4] S. Haykin, Communications Systems. New York: John Wiley & Sons, Inc., 2001. [5] Claude E. Shannon, “A mathematical theory of communication,” Bell Systems Technical Journal. [Online]. Available: http://plan9.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf Francisco J. Escribano Block 2: Introduction to Information Theory January 24, 2018 62 / 62