IO

From ComputingForScientists

Jump to: navigation, search

Contents

  1. IO
    1. Overview
    2. Terminology
    3. Motivation
    4. ASCII
    5. Binary
      1. A Simple Encoding
      2. Image Encoding
      3. Scientific Encoding
  2. Problems
    1. Your own Encoding
    2. Simple Spreadsheet Encoding
    3. Scientific File Format
    4. Speed Comparison
    5. Speed Comparison
    6. Binary

1. IO

I/O stands for Input/Output.

1.1. Overview

Nearly every programming language has the capability of reading and writing data. Many languages have special functions or libraries that allow files in various standard formats to be read into memory.

However, as a scientist, you will often encounter files that require you to write your own special function or library in order to use and visualize the data. The basic algorithms for both ASCII and binary data are

  • Read a record
  • Write a record

And for ASCII:

  • Skip lines
  • Test lines

1.2. Terminology

  • I/O - Input/Output, AKA Read/Write, Import/Export.
  • Records - A record usually means a self-contained chunk of information.
  • Fields -
  • Delimiters - How records are separated.
  • Headers - Information that appears before the records in a file. The information is usually intended to help interpret the records.
  • Type
  • Cadence - For records that have time stamps, it is the separation in time between records.

For example an instrument that measures temperature every hour may have records containing the time of the measurement and the temperature and the file may be

# Temperature measurements in Fahrenheit in Farifax VA on 2016-01-01
Hour  Temp
00:00 32
01:00 33
02:00 31

In this case the header is comprised of the first two lines and the records are separated by newlines. The fields in each record are hour and temperature.

1.3. Motivation

A significant amount of scientific data is stored in files. These files may be "ASCII" or binary.

Note:

In what follows everything after # should be entered on the Bash command line; everything after >> should be entered on the MATLAB/Octave command line; # (or >>) should be omitted. If a command does not start with either symbol, the command should be entered on the Bash command line.

1.4. ASCII

When data are stored on disk in an ASCII file, all that is stored is a sequence of 0s and 1s. When you open a file with a program, that program translates these bits into something you recognize (e.g., letters and numbers).

When a character is entered into a text editor and written to a file, it is first translated into a bit pattern. If you create a text file with the letter A and save it as file.txt, the information written to file.txt is 41 in hexadecimal (and 0100 0001 in binary; see http://www.ascii-code.com/).

To create this file, enter

echo -n "A" > file1.txt

The command line program od can be used to inspect the bit pattern (actually, od does not output bit patterns as they take up too much space. Instead, od outputs hexadecimal values that can be converted into bit patterns).

od -t x1 file.txt

displays

0000000 41
0000001

The interpretation of the first row of the output of od is

  1. first column is that it represents the starting byte number of the bytes in the following columns of the same row
  2. second column is the first byte (8 bits) in the file represented as hex for brevity

According to http://www.ascii-code.com/, the hex value 41 corresponds to the symbol A, as expected.

We can create a file with two letters using

echo -n "AA" > file2.txt

and then inspect the content using

od -t x1 file.txt

displays

0000000    41  41                                                        
0000002

Notice that both of these files were created using the command echo with the switch -n. The -n switch tells echo to not append a newline. Try the echo command without the switch to see the difference:

echo "A" > file3.txt

and

od -t x1 file3.txt

which displays

0000000 41 0a
0000001

The extra value 0a corresponds to a newline character. This symbol tells any program that is displaying the contents of the file in a human readable form to start a new row before displaying additional information. This newline character (hex value 0a) is one of the "non-printable" characters in the ASCII table (http://www.ascii-code.com/).

Also note that od can be used to display non-printable characters in a more human readable format using the switch -t c instead of -t x1:

od -t c file3.txt

which displays

0000000   A  \n
0000002

A program that reads an ASCII file and displays the results on the screen simply reads 8 bits, looks up the character associated with these 8 bits in the ASCII table, and then displays the character. It then repeats this process with the next 8 bits, etc.

One of the problems with ASCII files is that there is no information in the file that tells a program that the contents should be decoded according to the ASCII rules. To demonstrate this, create a file (on Windows) with Notepad and save the file without an extension. When you double click the icon for the file, you will be prompted to select a program to read the file. The reason is that Windows has no way of knowing that the file is encoded in ASCII. Usually files encoded in ASCII have an extension of ".txt" or ".dat", but this is not always the case.

Some operating systems will guess that a file with no extension should be read by a text editor by inspecting a few bytes of the file. If these bytes correspond to printable ASCII characters, it will assume that the file is encoded in ASCII.

1.5. Binary

Technically, all files on your computer could be called binary files as they contain a sequence of 0s and 1s.

However, we usually don't call an ASCII file a binary file. What we should really say is "it is a file encoded in ASCII". We don't have to mention that the file is binary - all files on a computer are binary files.

What we mean by a binary file is a sequence of 0s and 1s that must be translated into something meaningful using something besides the ASCII table. Said another way, a binary file is just a list of 0s and 1s that cannot be understood unless we are told how to interpret it.

1.5.1. A Simple Encoding

In this section, we introduce a simple spreadsheet encoding rule in order to demonstrate binary encoding.

Suppose a spreadsheet internally represents each decimal number in a cell as a 64-bit pattern. When the file is opened, the operating system copies all of the bit patterns in the spreadsheet from slow memory (usually a disk) into fast memory (RAM). The file must also contain information so the spreadsheet can determine which cell is associated with each pattern. If the spreadsheet had only one value, say 100,000,000, it would need to store the cell name (e.g., A1) along with the location of the bit pattern associated with this name.

If we encoded the variable name information in 7-bit ACSII, the file would need to contain the following information (the first row is the human readable version and the second row is what is written in the file):

    A        1        100000000
 1000001  110001 101111101011110000100000000

In reality, most programs use a standard size for each element above to simplify decoding. If the encoding rule was that the variable name information was encoded in 8-bit ASCII and the cell value as a 64-bit binary representation of the cell value, the file would actually contain extra zeros (the first row is the human readable version and the second row is what is written in the file):

  A         1                                  100000000
01000001  00110001 0000000000000000000000000000000000000101111101011110000100000000

When the file is read, the spreadsheet reads the first 16 bits to determine the cell name (e.g., A1) in which to place the number found by converting the next 64 bits to an integer. If the end of the file is not found after the 80th bit, it reads the next 16 bits to determine the cell in which to place the next number found in the next 64 bits.

The simple spreadsheet encoding rule is thus

  • Cell names and contents occur in 80 bit chunks.
  • First 8 bits of a chunk is the cell row. The bit pattern is translated to a letter using the ASCII table.
  • Second 8 bits of a chunk is the cell column. The bit pattern is translated to a number using the ASCII table.
  • The last 64 bits correspond to the value of the cell. The bit pattern is converted to a number using the binary to decimal method.

The above is not actually how information is encoded in a spreadsheet file. In addition, the spreadsheet file has other information (for example, what color each cell is). However, this simple spreadsheet encoding is sufficient to explain how information could be encoded.

1.5.2. Image Encoding

Files that contain images each have their own encoding rule and a program must be written to translate the 0s and 1s in an image file into color pixels. As an example of an image format, consider a PNG (Portable Network Graphics) file.

First download a PNG file

curl "https://www.google.com/images/srpr/logo11w.png" > logo11w.png

Next, inspect the binary contents (represented as hex) of the start of the file (using head -1):

od -t x1 logo11w.png | head -1

The result is

0000000    89  50  4e  47  0d  0a  1a  0a  00  00  00  0d  49  48  44  52

How do we interpret this? According to [1]

The first eight bytes of a PNG file always contain the following (decimal) values: 137 80 78 71 13 10 26 10

The output of od is hexadecimal, so we need to convert the values to decimal. For example, 13710 is 8916 is . Note that the decimal values 80 78 71 correspond to P N G if interpreted according to the ASCII table.

1.5.3. Scientific Encoding

The reason that we don't just use ASCII files to store numbers is that ASCII is inefficient in comparison to binary in two ways:

  1. Storing the same information requires more memory
  2. Reading and writing the same information requires more time.

As an example of the first case, suppose that I wanted to save a the number: 1230. To save this in ASCII would require 4 bytes. Using the ASCII table,

  • The bits 00110000 correspond to 0
  • The bits 00110001 correspond to 1
  • The bits 00110010 correspond to 2
  • The bits 00110011 correspond to 3

To store the sequence of numeric values 1230, I would write the binary sequence:

00110011 00110000 00110001 00110010

I could define a binary encoding format where

  • The bits 00 corresponded to 0,
  • The bits 01 corresponded to 1,
  • The bits 10 corresponded to 2, and
  • The bits 11 corresponded to 3,

To store the sequence of numeric values 0123, I would write the binary sequence:

11 00 01 10

Because this sequence of numbers is shorter in this case, the amount of memory used and the amount of time required to write the file is less.

Of course, the disadvantage of the binary encoded file is that if I gave you the file in my special binary format, you would not know how to interpret it until I told you the binary encoding rule.

In the same way that there are many file formats for images, there are many formats for scientific data - each science domain tends to use one or two file formats. Examples include

  • FITS - Astronomy
  • HDF - HPC, Earth Science
  • HDF5- HPC, Earth Science
  • CDF- Space Science
  • netCDF - Earth Science

In addition, when you save data from MATLAB, by default it is stored in MATLAB's special binary format. When you type

>> A = [1:10];
>> save A.mat A

into MATLAB, the numbers in array A are saved into MATLAB's binary file named A.mat.

2. Problems

2.1. Your own Encoding

Write down a sequence of 0s and 1s along with a sentence or two on how to interpret your sequence. Give your sequence and sentences to a partner and see if they can figure out what your sequence of 0s and 1s means.

Trivial Example:

  • Sequence: 01 03 02 03 00
  • Sentence: "00 means the word "cake", 01 means the word "I", 02 means the word "like", and 03 is a space.

Correct Interpretation:

  • The sequence means "I like cake".

2.2. Simple Spreadsheet Encoding

A file encoded according the simple spreadsheet encoding rule contains the sequence of bits (spaces and newlines added for readablity)

01000001  00110001 0000000000000000000000000000000000000101111101011110000100000000
01000001  00110010 0000000000000000000000000000000000000000000000000000000000000111

What are the contents of the spreadsheet? That is, what numbers are in what cells?

2.3. Scientific File Format

Find a file encoded in HDF 5 (usually has an extension of ".h5").

Find a program that allows you to inspect the numbers in this file and save it as an ASCII file.

2.4. Speed Comparison

In MATLAB, create an array of 2^16 random values using randn

Compare the file sizes and the time required to save this list to a file and then read the file:

  • Using fwrite to save these values as doubles and fread to read the values back in.
  • Using save -ascii -double to save these values and load to read the values back in.

2.5. Speed Comparison

Perform the same speed comparison as was done in the previous problem, except use Python.

2.6. Binary

Problems of the level covered at [2]

A binary file format has the following specification:

  • The first 32 bytes are an unsigned 32-bit integer corresponding to the time of a measurement in milliseconds since January 1, 1970 at 00:00:00.000.
  • The next 64 bytes are doubles (written using MATLAB's fwrite function). This value corresponds a measured temperature.
  • The next 32 bytes are an unsigned integer corresponding to the time of a measurement in milliseconds since January 1, 1970 at 00:00:00.000.
  • The next 64 bytes are doubles (written using MATLAB's fwrite function). This value corresponds a measured temperature.
  • etc.

The file is located at [3].

Based on the size of the file reported by ls -l, how many temperature measurements are in the file?

Personal tools