1.1. Shannon's entropy

         Entopy is a term borowd from physics(termodinamics) representing a measure unit for the quantity of desorder in a system. In information theory, entropy, is a mapping from a probabiliti distirbution (or a random variable) to real numbers. It measure the average amount of information gained when an outcome of the random variable is observed.
         We intuitively know what information is and we also feel that it is an elusive, non-mathematical quantity that cannot be precisely defined and measured. Intuitively information is a layer between sintactic representation(data or messages) and semnatics (knowledge obtained when data is interpreted).
         But from the engineering point of view, information must NOT be confused with meaning "The semantic aspects of communications are irrelevant to the engineering aspects"[1]. Information is a measure of one's freedom of choice and is measured by the logarithm of the number of choices.
         The importance of information theory is that it quantifies information. The first intuitive observation is that the information content of a message is equivalent to the amount of surprise in the message.
         Lets start first with a simple example. Consider a binary decision tree where each node, excepting the leaves, is an YES or NO questions.The leaves contains possible results (or knowledge). Encoding NO with 0 and YES with 1, then a possible result can be a bits string representing the answers to the questions that leads to that result. Consider now to persons P1 and P2, P1 ask the questions and P2 gives the answers. As more questions P2 answer, as much information P1 gets.The average number of questions to answer in order to reach a leaf is actually the average amount of information P1 can obtain. If the tree is equilibrated, the average amount of information is proportional to , where N is the number of leafs.


Fig. 1. A decision tree


         Consider the decision tree in Fig. 1, and try to estimate the average number of questions to answer in order to reach a leaf starting from the root.
  • the average number of questions to reach a leaf from D is 1;
  • the average number of questions to reach a leaf from B is 1 + the average numbers of questions to reach a leaf from the successors of B ;
  • the average number of questions to reach a leaf from C is 1;
  • the average number of questions to reach a leaf from A is 1 + the average numbers of questions to reach a leaf from the successors of A

         But what happens if some answers are more probable then other? Obviously, if P2 gives a more probable answer, the gain of information for P1 is smaller. This situation will be discussed further.
         Consider N events whose probabilities of occurrence are .Can we measure how much choice is involved or which is the gain of information when some of those events happen? This measure is the entropy denoted by . It is reasonable to require the following properties of H[1]:
  1. H should be continuous in ; a small change of a value of pi, for an arbitrary i, should cause small change in the value of H;
  2. if , for all i then H should be a monotonic increasing function of N;

  3. Fig. 2. Decomposition of choice
  4. If a choice is broken down into two successive choices, the original H should be weighted sum of the individual values of H;in Fig. 2 we have:

Shannon prooved that the only H satisfying the three above assumptions is of the form:
,
where K is a positive constant