Encoding

1. Encoding

1.1. Definitions

Bits are the individual zeros and ones that are stored by computers. A bit can have one of two states usually described as one of the following pairs: zero or one; on or off; high or low; or open or closed. A byte is a group of eight bits. A pattern of bits is a list of ones and zeros. Encoding is the translation of a character into a pattern of bits. More generally, encoding is the translation of a symbol or a sequence of characters into a pattern of bits.

The following table shows SI prefixes that are commonly used when describing a quantity of bits or bytes. In the same way that one may need to specify the base of a number such as 101 as being either base 2 or base 10, the system of units that are being used when using a prefix are needed to remove ambiguity. The reason is that in the SI unit system, 1 MB means 106 = 1,000,000 whereas 1 MB in computing could mean 220 = 1,048,576. In this book, we always use the SI prefixes. However one must always be aware of this ambiguity. (To remove the ambiguity, use the IEC prefixes, which are the same as the SI prefix except they have an "i" at the end. For example, 1 MiB = 220.)

 1018 exa E 1,000,000,000,000,000,000 1015 peta P 1,000,000,000,000,000 1012 terra T 1,000,000,000,000 109 giga G 1,000,000,000 106 mega M 1,000,000 103 kilo k 1,000

1.2. Methods of Storing Bits

There are many methods for storing a list of bits. Examples include placing holes and bumps on a piece of paper, charging a set of objects negative or positive, aligning magnets in the north or south direction, or creating a pit into a piece of plastic or metal.

The following is a primitive method of storing bits.

A simple memory stick with small round magnets attached to a piece of wood. If the person holding the large magnet sees the stick moves toward him, he records a 1. If it moves away from him, he records a 0.

A more advanced method of storing bits is to use a CD or DVD. A CD or a DVD has small pits in it. In locations where there is a pit, the sensor stops receiving a reflected signal and this is interpreted as 0. In locations where there is not a pit (a "land"), there is a reflected signal and this is interpreted as 1. What factors do you think control how many bits can be stored per unit area? What factors do you think control how quickly the bits can be read/written?

From img.tfd.com on May 18 2019 17:06:46.

1.3. Encoding Motivation

You are a "forensic computer scientist" and are given a DVD. You inspect it and find that it contains a list of 1s and 0s. How do you translate the list of bits into something useful?

 From upload.wikimedia.org on May 19 2019 08:37:35. = 001010010100101001010101 010101010101010100101000 010111101010101010101111 111110111111101111111111 001010010100101001010101 010101010101010100101000 010111101010101010101111 111110111111101111111111

1.4. Encoding Table

Encoding requires the use of an encoding table - a table that associates a bit pattern with a character (or sequence of characters). Such an encoding table was used previously in the Binary Representation of Numbers. As shown in Table 1, we associated bit patterns of length two with a binary integer.

Encoding Table 1

 bit pattern characters 00 0 01 1 10 2 11 3

In this case the bit pattern 1110 is associated with the sequence of characters 32.

Bit patterns can be associated with more than just decimal integers. In Table 2, the four possible bit patterns of length two are associated with a color.

Encoding Table 2

 bit pattern characters 00 red 01 green 10 blue 11 black

In this case the bit pattern 1110 is associated with the sequence of characters blackblue.

If Encoding Table 2 was used to write a message on a primitive memory stick and the pattern was 00110011, the decoded message associated with this bit pattern would be redblackredblack.

Encoding Table 3

 bit pattern characters 00 zero 01 one 10 two 11 three

If you and I agree to use Encoding Table 3 and I hand you a primitive memory stick with the pattern 00110011, you would decode the bit pattern to mean zerothreezerothree.

1.5. 7-bit ASCII Encoding Table

There is a special encoding table called the 7-bit ASCII table; an excerpt is given below

 bit pattern character 1100001 a

If a forensic computer scientist is analyzing a hard drive and knows that your computer encodes information using the 7-bit ASCII Table, and sees the bit pattern 1100001, he will know that you wrote the letter a.

Unlike Encoding Tables 1 and 2 where a long sequence of characters were associated with a short bit pattern, the 7-bit ASCII table associates a single character with a long bit pattern (7 bits). The reason is that with 7 bits there are enough unique bit patterns (27=128) to have a unique bit pattern associated with all of the common characters used in the English language.

Instead of listing character and bit pattern associations, encoding tables often list only character and decimal value associations. To determine the bit pattern, the decimal number must be converted to binary.

 bit pattern character decimal value 1100001 a 97

If a forensic computer scientist tells you that he read "ASCII value 97" on a hard drive, he means that he read the bit pattern 1100001. This is sometimes more convenient than saying "I read the binary value 1100001" on a hard drive.

1.6. 7-bit ASCII Encoding

The following is part of the 7-bit ASCII decimal encoding table. (The table is also referred to as the 7-bit ASCII character set.) If I speak ASCII Decimal and say 72 73, you would know that I meant HI after looking at the table. If I speak 7-bit ASCII Binary and say 1001000 1001001, you would need to convert the binary numbers to their decimal representations of 72 73 before using the table.

7-bit ASCII Decimal Encoding Table

2.27. Estimate

The 1984 science fiction novel Neuromancer by William Gibson contains 271 pages of text. Each page contains, on average, approximately 400 words. Each word, is on average, five ASCII characters long. Knowing that each ASCII character requires 7 bits of computer memory storage, how many bytes of computer memory storage are required to store all the words from Gibson's Neuromancer novel?

 A. 271,000 bytes B. 34,688,000 bytes C. 4,336,000 bytes D. 542,000 bytes

Personal computers in 2010 can come equipped with hard disk drives having 1 terabyte of storage capacity (a terabyte is one trillion bytes, or 1,000,000,000,000 bytes). Approximately how many copies of William Gibson's Neuromancer novel could you store on your 1 terabyte hard disk drive, assuming the entire disk is available for storage?

To find estimates for these problems and the following ones, read the following information on estimates.

2.28. Estimate

A page from a book contains 500 words, and each word contains on average 4 characters. Considering only the characters on the page, how many bits of information does the page contain? (Assume that the characters are encoded using bit patterns of length 7.)

 A. 500 bits B. 2,000 bits C. 8,000 bits D. 6,000 bits E. None of the above

2.29. Estimate

If a single ASCII character (from the extended set) can be represented by 7 bits, and we have a 500 gigabyte hard drive available for storage, about how many ASCII characters can be stored on this hard drive? (NOTE: One gigabyte is equal to about one billion bytes)

 A. about 500 million ASCII characters B. about 8 billion ASCII characters C. about 64 billion ASCII characters D. about 500 billion ASCII characters E. None of the above

2.30. Estimate

A U.S. one dollar bill measures about 6 centimeters wide by 11 centimeters long. If there are 330 ASCII characters printed on one side of it, then what is the approximate data density (in bytes per cm2) of one side of a U.S. one dollar bill? (Assume each ASCII character is encoded with a 7 bit pattern.)

 A. about 4.4 bytes per square centimeter B. about 5.5 bytes per square centimeter C. about 6.6 bytes per square centimeter D. about 7.7 bytes per square centimeter E. None of the above

2.31. Data Density

On a piece of 8.5 inch x 11 inch piece of paper, you can write about 500 letters and numbers ("characters"). Suppose that each character is encoded as an 8-bit binary number.

1. What is the data density of the information stored on the piece of paper in bytes per square inch?
2. If this information was stored on a hard drive, how many ones and zeros would be needed?

2.32. Data Density

How many 7-bit ASCII characters could you store on a hard drive? (Choose a reasonable value for the storage capacity for a hard drive on a laptop or desktop computer purchased in the last five years.)

3. Activities

3.1. Prefixes

Why do you think a number like 220 is used in computing to describe a MB instead of 106?

Guess the value of x in the statement: 1 TiB = 2x.

An encoding table is a table that associates a bit pattern with a character or object. For example, and ASCII table associates the bit pattern1001000 with the character H.

Create an encoding table with bit patterns each with three bits. Write a binary encoded message. Hand the message to your partner along with the decoding table and see if they can determine what your binary encoded message means.

3.3. Discussion Question

How would a forensic computer scientist figure out that the yellow numbers correspond to an Excel Spreadsheet document?

 From upload.wikimedia.org on May 18 2019 17:06:46. = 001010010100101001010101 010101010101010100101000 010111101010101010101111 111110111111101111111111 001010010100101001010101 010101010101010100101000 010111101010101010101111 111110111111101111111111

3.4. Discussion Question

 From www.wiilovemario.com on May 19 2019 08:37:36. Explain in basic terms the meaning of the following: "The old monitor only supports 8-bit color, my monitor supports 24-bit color". (Related link: 8-bit art:[1])

3.5. Other Encoding

The 7-bit ASCII character set only has 128 possible characters. How would you encode the Greek letter beta in binary on a computer? That is, suppose you were asked to come up with a scheme for encoding the Greek letter beta in binary so that anyone who saw the particular list of zeros and ones would immediately know you meant the Greek letter beta.

3.6. Experiment

ASCII is one of many ways to encode numbers and characters as binary patterns.

Question: Compare this search: CCCP with this search: CCCP. Why are the results different? The text in the search box looks the same!

Experiment:

• Click the first link, copy the text CCCP from the search box, paste it into Notepad (or TextEdit on a Mac), and save. What is the size of the saved file?
• Do the same for the second link. What happens when you choose ANSI as the encoding versus UTF-8 when saving? What is the size of the file when you chose ANSI? UTF-8? When you choose ANSI, exit and re-open the file you saved. What do you see?

3.7. Estimates

Note: In the problems that request an estimate, your answer can be in a fairly wide range of values. Your explanation is more important than your actual number. For example, if I asked you to compare the area of a DVD to the area of a sheet of paper, a correct answer could be "I can lay four DVDs on a sheet of paper and it covers up most of the paper. Therefore the area of a DVD is about four times that of a sheet of paper." Another correct answer could be to compute the area of the sheet of paper and the area of the DVD using the formulas for the area of a rectangle and the area of a circle and then take the ratio of these areas. The first answer is less accurate, but the approach is more in the spirit of an estimate.

1. Estimate the number of characters (the numbers 0 through 9, letters a-z upper and lowercase) that you could write on a piece of paper by hand using a pen or pencil. Explain how you arrived at your estimate.
2. Estimate the number of bits of information that you could store on a single sheet of notebook paper using only a pen. Explain how you arrived at your estimate.
3. Estimate the number of bytes of information that you could store on a single sheet of notebook paper using only a pen. Explain how you arrived at your estimate.
4. How many sheets of notebook paper would you need to store the same number of characters that can be stored on a DVD? (Assume that the characters on the DVD are only from the 8-bit ASCII character set.)
5. Estimate the density of data stored on your sheet of notebook paper (in bytes per square centimeter).
6. Estimate the density of data stored on a DVD (in bytes per square centimeter).

3.8. Encoding Chinese

7 bits are required to represent the most commonly used written English language characters. The 7-bit ASCII character set relates 128 binary numbers to 128 commonly used characters in the English language.

For example, the bit pattern 1100001 corresponds to the character a and 1000001 corresponds to the character A`.

The Chinese character set is composed of unique characters that taken together comprise the written Chinese language. A college-educated Chinese adult is fluent with 6,000 to 7,000 unique Chinese characters.

How many bits are required to represent the entire set of written Chinese characters for a college-educated Chinese adult?

4. Resources

• Wikipedia page with a table of SI and IEC prefixes: [3].
• Wikipedia page on history of the use of the word "bit": [4].
• Articles about how computer memory works:
• Using paper instead of computer memory:
• An advanced discussion of encoding [5].
• The binary alphabet is superior to the "best" alphabet (Korean) [6].
• Joke: Two bytes walk into a bar. The bartender says "Can I get you anything?" The bytes say "sure, make us a double." (A sequence of sixteen bits is called a double.)