Reading Text Files

From ComputingForScientists

Jump to: navigation, search

Contents

  1. High-Level ASCII Read
    1. Example: Read result of the save Command
    2. Example: Read result of save --double
    3. Example: Sunspot Number Files
  2. Low-Level ASCII Read
    1. fgetl
      1. fgetl Examples
        1. Example: Using str2num
        2. Example: Using str2num
        3. Example: Using regexprep
    2. fscanf
      1. fscanf Examples
        1. Example: Basic Usage
        2. Example: Combining with fgetl
  3. Efficiency Considerations
  4. Problems
    1. High-Level ASCII Read I
    2. High-Level ASCII Read II
    3. High-Level ASCII Read III
    4. High-Level ASCII Read IV
    5. Low-Level ASCII Read I
    6. Low-Level ASCII Read II
    7. Low-Level ASCII Read III
    8. Low-Level ASCII Read IV
    9. Low-Level ASCII Read V
    10. Low-Level ASCII Read VI
    11. Low-Level ASCII Read VII
    12. Low-Level ASCII Read VIII
    13. Low-Level ASCII Read IX
    14. Low-Level ASCII Read X
    15. Low-level ASCII Read XI
    16. Low-Level ASCII Read XII
    17. regexprep

1. High-Level ASCII Read

The function load can be used to read ASCII files, provided that the files contain only numbers and the number of columns is fixed.

1.1. Example: Read result of the save Command

Executing the commands

% Create a 5x5 arrary
M = rand(5);
% Create an ASCII file containing the contents of M
save -ascii file.txt M
% Read the file using the load command
load file.txt
% Should see a variable named file that has the same size as M. 
whos 

results in the contents of file5.txt being imported into a variable named file. To compare M and file5, enter

M - file

on the command line.

1.2. Example: Read result of save --double

Executing the commands

M = rand(5);
save -ascii -double file.txt M 
load file.txt
whos

results in the contents of file.txt being imported into a variable named file. To compare M and file5, enter

M - file

on the command line.

1.3. Example: Sunspot Number Files

Reading an ASCII-encoded file using load is not always as easy as in the example above. As an example of a the typical effort to read a file, consider the daily total sunspot number data at http://www.sidc.be/silso/versionarchive, which at the time of this writing had two options for downloading ASCII files which were copied here: dayssn.dat and dayssnv0.dat.

Save dayssn.dat to the same directory that is listed when you type pwd (this prints the working directory) on the MATLAB command line. View it in a text editor (e.g., gedit on Linux, Notepad on Windows, and TextEdit on Mac) and then try

load dayssn.dat

You should get an error:

Number of columns on line 71678 of ASCII file dayssn.dat
must be the same as previous lines.

In general, MATLAB's load command is best at reading an ASCII file that contains only numbers (e.g., 1.0, 1e+10, etc.), with a fixed number of columns.

The second option for a file was for dayssnv0.dat. Save it to the same directory that is listed when you type pwd. Attempting

load dayssnv0.dat

gives the same error:

Number of columns on line 71678 of ASCII file /home/weigel/dayssnv0.dat
must be the same as previous lines.

At this point, it makes sense to have a look at the line described in the error. Inspection of line 71678 shows that the problem is that the number of columns changed from three to four at that line. MATLAB's load command attempts to map the file contents to a matrix and the number of columns in the matrix changed, so it gave up.

The easy way to resolve this is to open the file with a text editor. Doing this, you will see that at this line there was a new column containing *. Remove any lines that contain a * and then save the file as dayssnv0_modified.dat and try to load the file again. Save the removed lines in dayssnv0_cut.dat.

load dayssnv0_modified.dat

You should not get an error. MATLAB has loaded the data into an array named dayssnv0_modified, which can be verified by typing whos on the command line:

 dayssnv0_modified      71678x3             1720272  double              

The above is a typical approach new users use for reading an ASCII-encoded file into MATLAB. Sometimes the file will load without modification. Most of the time, you will need to use a procedure similar to the above, a GUI [1], or write a program that uses a low-level function such as fprintf.

Notes:

The data file could have been loading using the GUI found by selected File > Import. Doing this with MATLAB 2009a lead to an import without an error message, but the user was not warned that the lines with a * were not loaded. This is not a good practice with respect to GUI design.

In practice one would use load for the two files, dayssnv0_modified and dayssnv0_cut, and then combine the results.

To use a different name for the variable, use

M = load('dayssnv0_modified.dat'); % Import data into variable named M.

To save the file to a different location than given by MATLAB's pwd command, includ the path in the load command argument, e.g.,

M = load('/home/rweigel/Downloads/dayssnv0.dat');

An alternative method for doing the above is described in #Efficiency Considerations.

2. Low-Level ASCII Read

One of the most straightforward methods to read an ASCII file is using fgetl [2]. The drawback is that this method is usually slower than using functions such as fscanf. The reason is that fgetl reads a line from a file as a string and then parts or all of the string must be converted to numeric values and placed in a matrix. In contrast, when using fscanf, the structure of the line is pre-specified and so conversion is performed at a lower level using libraries that execute faster. More importantly, fscanf can be used to read an entire file without a loop (and loops can be very slow in scripting languages such as MATLAB).

2.1. fgetl

The help page reads:

FGETL Read line from file, discard newline character.
   TLINE = FGETL(FID) returns the next line of a file associated with file
   identifier FID as a MATLAB string. The line terminator is NOT
   included. Use FGETS to get the next line with the line terminator
   INCLUDED. If just an end-of-file is encountered, -1 is returned.  

2.1.1. fgetl Examples

2.1.1.1. Example: Using str2num

A file data.txt contains

1 19.1 13e4
2 20.2 15e4

The following program reads each line of the file into a string variable named tline:

clear;
fid = fopen('data.txt','r');
while 1
  tline = fgetl(fid);
  if ~ischar(tline)
      break;
  end
  tline
end
fclose(fid);

The lines

if ~ischar(tline)
   break;
end

terminate execution of the while loop if the line does not contain a character (this happens at the end of the file). We want to convert the string into numeric values that can be added, plotted, etc. To do this, use the function str2num, which converts a string to an array provided that the string looks like it contains only numbers.

clear;
fid = fopen('data.txt','r');
while 1
  tline = fgetl(fid);
  if ~ischar(tline),break,end
  str2num(tline)
end                                                                             
fclose(fid)

The final step is to build an matrix, with one line per row of the matrix:

clear;
fid = fopen('data.txt','r');
k = 1;
while 1
  tline = fgetl(fid);
  if ~ischar(tline),break,end
  A(k,:) = str2num(tline);
  k = k+1;
end
fclose(fid);

2.1.1.2. Example: Using str2num

In the previous example all of lines in the file contained numbers that we wanted to put into a matrix, and these lines could be converted to a numeric array using str2num. Consider a file named data2.txt containing:

# A header
# A header
 1 19.1 13e4
 2 20.2 15e4

In this case, we need to skip the first two lines. We can do that by reading them and then doing nothing with the result:

clear;
fid = fopen('data2.txt','r');
% Read first two lines
for i = 1:2
  tline = fgetl(fid);
end

% Read lines with numbers
k = 1;
while 1
  tline = fgetl(fid);
  if ~ischar(tline),break,end
  A(k,:) = str2num(tline);
  k = k+1;
end
fclose(fid);
A

2.1.1.3. Example: Using regexprep

In the previous example, each line with numbers could be converted using str2num. Consider the file data3.txt:

# A header
# A header
2014-01-01 00:00 19.1 13e4
2014-01-01 00:01 20.2 15e4

The line 2014-01-01 00:00 19.1 13e4 can not be converted properly using str2num. To see this, enter

format long
str2num('2014-01-01 00:00 19.1 13e4')

which displays ans =

  1.0e+05 *
  0.020120000000000  0  0.000191000000000 1.300000000000000

The first value is 2012, so MATLAB has assumed 2014-01-01 meant subtraction. We need to replace these non-numeric characters with a space. This can be done using regexprep (regular expression replace):

% Replace - with space
tmp = regexprep('2014-01-01 00:00 19.1 13e4','-',' ')
% Replace : with space
tmp = regexprep(tmp,':',' ')

displays

tmp =

2014 01 01 00:00 19.1 13e4
    
tmp =

2014 01 01 00 00 19.1 13e4

so that

str2num(tmp)

gives the desired result:

ans =
  1.0e+05 *
   0.0201    0.0000    0.0000         0         0    0.0002    1.3000

Putting everything together gives

clear A
fid = fopen('data3.txt','r');
% Read first two lines and do nothing with them
for i = 1:2
  tline = fgetl(fid);
end

% Replace characters that cause str2num to fail
k = 1;
while 1
  tline = fgetl(fid);
  if ~ischar(tline),break,end
  tline = regexprep(tline,'-',' '); % Replace hyphen with space
  tline = regexprep(tline,':',' '); % Replace colon with space
  % Or ('-|:' means '-' or ':' as | is the symbol for logical or in regular expressions)
  % tline = regexprep(tline,'-|:',' '); % Replace hyphen or colon with space

  A(k,:) = str2num(tline);
  k = k+1;
end
fclose(fid);
A

2.2. fscanf

In lower-level languages (e.g., Fortran and C), one would iterate through each line in a file and use fscanf for each line. In MATLAB, fscanf does not require a loop to read each line - the function is said to be "vectorized". The first part of the help page for fscanf is

 fscanf Read data from text file.
    [A,COUNT] = fscanf(FID,FORMAT) reads and converts data from a text file
    into array A in column order. FID is a file identifier obtained from
    FOPEN. COUNT is an optional output argument that returns the number of
    elements successfully read.
 
    FORMAT is a string containing ordinary characters and/or conversion
    specifications, which include a % character, an optional asterisk for 
    assignment suppression, an optional width field, and a conversion
    character (such as d, i, o, u, x, e, f, g, s, or c).
 
    fscanf reapplies the FORMAT throughout the entire file. If fscanf
    cannot match the FORMAT to the data, it reads only the portion that
    matches into A and then stops processing. For more details on the 
    FORMAT input, type "doc fscanf" at the command prompt.

2.2.1. fscanf Examples

In general, fscanf will be faster than using a loop with fgetl.

In the examples below, note that the matrix returned by fscanf must be reshaped and/or transposed to recover the original.

2.2.1.1. Example: Basic Usage

clear;
% Create a matix
A = [1:100;100:-1:1];
% Inspect a few elements
A(1:2,1:2)

% Create file
fid = fopen('file.txt','w');
fprintf(fid,'%d %f\n',A);
fclose(fid);
type file.txt

% Read without specifying size
fid = fopen('file.txt');
B = fscanf(fid,'%d %f\n');
whos B
B = reshape(A,2,length(A)/2)';
whos B
B(1:2,1:2)
fclose(fid);

% Read with specifying size
fid = fopen('file.txt');
B = fscanf(fid,'%d %f\n',size(A));
whos B
B = B';
whos B
B(1:2,1:2)
fclose(fid);

2.2.1.2. Example: Combining with fgetl

In this example, we read the first two lines using fgetl prior to reading the numbers.

% Create a file to be read
fid = fopen('file2.txt','w');
fprintf(fid,'Header Line\n');
fprintf(fid,'Header Line\n');
fprintf(fid,'%d %f\n',A);
fclose(fid);
type file2.txt

% Read the file
fid = fopen('file2.txt');
l1 = fgetl(fid); % Line 1
l2 = fgetl(fid); % Line 2
A = fscanf(fid,'%d %f\n',size(A));
A = A';
A(1:2,1:2)
fclose(fid);

3. Efficiency Considerations

When only a single file must be read, one typically uses the first approach that produces the desired result.

When 1000s of files must me read, speed efficiency becomes a consideration. Two techniques are covered here.

  1. Using fscanf('%c') to read the entire file as a string and then regexrep and str2num to convert the string to numbers.
  2. Using operating system programs to pre-process the files prior to reading them in MATLAB.

When large files must be read, memory efficiency becomes a consideration. For example, if we wanted to compute the average of a column in a large file that would not fit in memory, instead of using load, we would read the file in chunks, compute the sum of the chunk and the delete the chunk but keep the sum.

Example

MATLAB can easily read ASCII files with values that are separated by spaces and tabs and records that are separated by newlines. For example, a file named a.txt with contents

0 1 2
3 4 5

can be read using (the data will be placed in a matrix named a):

load a
a

Often data files have additional characters, for example, : and /. Such files cannot be read directly using load.

8/11/2012	14:00:00.014.262	158.866	80.3515	223.167
8/11/2012	14:00:00.029.887	158.885	80.285	223.215
8/11/2012	14:00:00.045.512	158.948	80.1195	222.661
8/11/2012	14:00:00.061.137	159.011	80.3179	222.358
8/11/2012	14:00:00.076.762	158.837	80.3302	222.826
8/11/2012	14:00:00.092.387	159.014	80.3052	223.354
8/11/2012	14:00:00.108.012	158.999	80.226	223.069
8/11/2012	14:00:00.123.637	159.055	80.1786	222.017
8/11/2012	14:00:00.139.262	158.909	80.3855	222.247
8/11/2012	14:00:00.154.887	158.833	80.3708	223.342

In the following, the entire file is read with one call to fscanf. The file could be read line-by-line, but this will be much slower.

fid = fopen('a.txt');
s = fscanf(fid,'%c'); % Read entire file at once.
fclose(fid);
s = regexprep(s,'/',' '); % Replace all slashes with a space
s = regexprep(s,':',' '); % Replace all colons with a space
% Replace patterns such as 11.98.987 with 11 98 987
% Use a program such as http://www.regexe.com/ to figure out regular expression and replacement pattern
s = regexprep(s,' ([0-9][0-9])\.([0-9][0-9][0-9])\.([0-9][0-9][0-9])',' $1 $2 $3');
% string now contains only spaces and newlines.  Convert to array using str2num.
d = str2num(s);

Example - Sunspot Number File Revisited

Earlier, sunspot number data was read using load after manually modifying the file. In this example, the contents of the entire file are read into a string s and then all of the * values are removed.

fid = fopen('dayssnv0.dat','r')
s = fscanf(fid,'%c');
fclose(fid);
% Remove all instances of *.
s = regexprep(s,'*',);
% Convert the string to an array.
dayssn = str2num(s);
whos dayssn

The result should be

 dayssn      71831x3             1723944  double              

The above approach is not ideal. Although we have the data loaded into MATLAB, we have lost information. In practice, one would use the above approach to read the data, and then if it was decided to use the data for a report or publication, one would use either a more advanced regular expression to replace * with a flag value (e.g., -99) and add NaN values when there were only three columns, or a low-level ASCII read approach so that all of the information in the file is represented in what was read into MATLAB.

To append a NaN to lines with three columns and to replace * values with a -9:

fid = fopen('dayssnv0.dat','r');
s = fscanf(fid,'%c');
fclose(fid);
tmp = regexprep(s,'([0-9])  \n','$1 NaN\n');
tmp = regexprep(tmp,'\*','0');
A = str2num(tmp);

figure(1);clf;
    plot(A(:,2),A(:,3),'b.');hold on;
    plot(A(:,2),A(:,4),'r.');
    legend('SSN (999 = missing)','Provisional')

4. Problems

4.1. High-Level ASCII Read I

Write a program that saves values from the matrix M = repmat([1:5],4,1) into a file named file.txt and then reads the values from the file and compares the values read with the values in M.

4.2. High-Level ASCII Read II

Write a program that saves values from the matrix M = repmat([1:5],4,1) into a file named file.txt and then reads the values from the file and compares the values read with the values in M. Use a version of the save command such that the values read exactly match the values in M.

4.3. High-Level ASCII Read III

Download and then use a text editor to modify the file testfile.txt so that the commands

load testfile_modified.txt
whos testfile_modified

displays

testfile_modified      4x3        96     double

4.4. High-Level ASCII Read IV

Execute the following program to create a file named file.txt:

fid = fopen('file.txt','w');
c = '-';
for i = 1:1000
    if i > 500
        c = '*';
    end
    fprintf(fid,'%d %f %s\n',i,rand(),c);
end
fclose(fid);

First, attempt to read this file using

load file.txt

Then, open this file in a text editor, modify it, and save the result as file_modified.txt such that

load file_modfied.txt
whos file_modified

displays

 file_modified       1000x3                    24000  double  

Finally, attempt to import the data (without information loss) in file.txt using the File Import GUI (I don't think it can be done).

4.5. Low-Level ASCII Read I

Consider the file data_I.txt created by executing the program:

fid = fopen('data_I.txt','w');                                              
fprintf(fid,'2013 01 01 00 00 01 2.1 3.1\n');                                   
fprintf(fid,'2013 01 01 00 00 02 4.1 9.9\n');                                   
fprintf(fid,'2013 01 01 00 00 03 4.1 12.9\n');                                  
fclose(fid);                                                                    

Execute this program on the command line, verify that the file data_I.txt was created, and then write a program that uses fgetl and str2num to read the values into a matrix.

4.6. Low-Level ASCII Read II

Consider the file data_II.txt created by executing the program:

fid = fopen('data_II.txt','w');                                                    
fprintf(fid,'The following columns have labels of\n');                          
fprintf(fid,'Time Amplitude Speed\n');                                          
fprintf(fid,'2013 01 01 00 00 01 2.1 3.1\n');                                   
fprintf(fid,'2013 01 01 00 00 02 4.1 9.9\n');                                   
fprintf(fid,'2013 01 01 00 00 03 4.1 12.9\n');                                  
fprintf(fid,'End of File\n');                                                   
fclose(fid);                                                                    

Execute this program on the command line, verify that the file data_II.txt was created, and then write a program that uses fgetl and str2num to read the values into a matrix.

4.7. Low-Level ASCII Read III

The following program creates a file named data_III.txt. Execute it, verify the file data_III.txt was created, and then write a program that reads the contents of the file into a matrix with two columns. Your program should use fgetl, str2num, and regexrep. In place of *s use NaN. Plot the first column versus the second column.

fid = fopen('data_III.txt','w');

for i = 1:1440
    v = randi(10,1);
    if v == 10
	fprintf(fid,'%d *\n',i);
    else
	fprintf(fid,'%d %d\n',i,v);
    end
end 
fclose(fid);

4.8. Low-Level ASCII Read IV

Consider the file data_IV.txt created by executing the program:

fid = fopen('data_IV.txt','w');                                                    
fprintf(fid,'The following columns have labels of\n');                          
fprintf(fid,'Time Amplitude Speed\n');                                          
fprintf(fid,'2013-01-01T00:00:01 2.1 3.1\n');                                   
fprintf(fid,'2013-01-01T00:00:02 4.1 9.9\n');                                   
fprintf(fid,'2013-01-01T00:00:03 4.1 12.9\n');                                  
fprintf(fid,'End of File\n');                                                   
fclose(fid);                                                                    

Execute it, verify the file data_IV.txt was created, and then write a program that reads the contents of the file into a matrix named A containing the numeric values of the file so that

A =

  1.0e+03 *

   2.0130    0.0010    0.0010         0         0    0.0010    0.0021    0.0031
   2.0130    0.0010    0.0010         0         0    0.0020    0.0041    0.0099
   2.0130    0.0010    0.0010         0         0    0.0030    0.0041    0.0129


4.9. Low-Level ASCII Read V

The following program creates a file named data_V.txt:

fid = fopen('data_V.txt','w');                                                    
fprintf(fid,'The following columns have labels of\n');                          
fprintf(fid,'Time Amplitude Speed Invalid\n');
fprintf(fid,'If value is invalid, the fourth column is 0, otherwise it is blank.\n')                                          
fprintf(fid,'2013-01-01T00:00:01 2.1 3.1    \n');                                   
fprintf(fid,'2013-01-01T00:00:02 4.1 9.9 0\n');                                   
fprintf(fid,'2013-01-01T00:00:03 4.1 1.9    \n');                                  
fprintf(fid,'End of File\n');                                                   
fclose(fid);

Execute it, verify the file data_V.txt was created, and then write a program that reads the lines containing number of data_V.txt into a numeric matrix. Note that one of the data lines has an extra column. When you create the matrix, place a 0 in the location where there is a missing column. Your final result should be

A =

   1.0e+03 *

    2.0130    0.0010    0.0010         0         0    0.0010    0.0021    0.0031         0
    2.0130    0.0010    0.0010         0         0    0.0020    0.0041    0.0099    0.0010
    2.0130    0.0010    0.0010         0         0    0.0030    0.0041    0.0129         0

4.10. Low-Level ASCII Read VI

The following program creates a file named data_VI.txt. Execute it and then write a program that reads the contents of the file into a matrix with columns of year, month, day, hour, minute, second, temperature, velocity.

fid = fopen('data_VI.txt','w');                                                                                   
                                                                                                               
fprintf(fid,'This file contains columns of time, temperature in [C], and wind speed in mph\n');                
dn = datenum([2016,4,16]) + [0:24*10-1]/24;                                                                    
for i = 1:length(dn)                                                                                           
    T = randi([60,70],1);                                                                                      
    V = randi([0,15],1);                                                                                       
    ds = datestr(dn(i),31);                                                                                    
    fprintf(fid,'%s %d %d\n',ds,T,V);                                                                          
end                                                                                                            
fclose(fid);                                                                                                   

4.11. Low-Level ASCII Read VII

Save the following text in a file named data_VII.txt and then write a program that uses fscanf to read it. Do not modify the contents of the following text in any way.

date,time,ch16_dc
2003/10/29,00:00:00,0.12
2003/10/29,00:00:02,0.18
2003/10/29,00:00:04,0.21
2003/10/29,00:00:06,0.15
2003/10/29,00:00:08,0.21
2003/10/29,00:00:10,0.21
2003/10/29,00:00:12,0.24
2003/10/29,00:00:14,0.18
2003/10/29,00:00:16,0.24

4.12. Low-Level ASCII Read VIII

Copy the following into a file named data_VIII.txt. Do not modify the content in any way. Write a program that reads the data for year 2000 into a matrix named D1999 and D2000. Each matrix should have 6 rows and 7 columns.

# Measurements of pH in Lake Michigan
# * indicates no measurement was made
# Year 2000 Measurements
2000-01-01T00:00:00 7.01
2000-02-01T00:00:00 7.01
2000-03-01T00:00:00 *
2000-04-01T00:00:00 6.98
2000-05-01T00:00:00 7.00
2000-06-01T00:00:00 6.97
# Year 1999 Measurements
1999-01-01T00:00:00 7.00
1999-02-01T00:00:00 6.99
1999-03-01T00:00:00 7.01
1999-04-01T00:00:00 6.98
1999-05-01T00:00:00 *
1999-06-01T00:00:00 7.00
# End of file

4.13. Low-Level ASCII Read IX

Write a program that reads the daily total sunspot number values from http://www.sidc.be/silso/datafiles without a loss of information.

4.14. Low-Level ASCII Read X

For each of the following files, write a program that reads the file (created using [3]) and plots the apparent magnitude versus the number of days since November 3, 2012. The number of apparent magnitude points in each of your plots should be 28. Save each of the files by clicking the link and the selecting File->Save.

4.15. Low-level ASCII Read XI

The web page http://aa.usno.navy.mil/data/docs/geocentric.php provides similar information as the web page considered in the previous problem.

  1. Write a program that reads in the data for Mars over the maximum available time range allowed.
  2. Create a plot comparing the apparent magnitude of computed by this web page with that found in the previous problem (over a time range when both have data for Mars).

4.16. Low-Level ASCII Read XII

Read in and plot the data for a planet at http://ssd.jpl.nasa.gov/horizons.cgi

Read in a plot the data for the solar radio flux at [4]

Sample solution for http://ssd.jpl.nasa.gov/horizons.cgi. How can this be improved?

clear
fid = fopen('horizons_results.txt','r');
for i=1:61
    line = fgetl(fid);
end

for i=1:2223
    line = fgetl(fid);
    if ~ischar(line)
        break
    end
    tmp = regexprep(line,'/T','1');
    tmp = regexprep(tmp,'/L','0');
    timestring = tmp(1:12)
    time(i) = datenum(timestring,'yyyy-mm-dd');
    tmp = tmp(19:end);
    D(i,:) = str2num(tmp);
end
fclose(fid);
plot(time-time(1),D(:,7));
xlabel('Days since 2010-Oct-31');
ylabel('Apparent Magnitude');

4.17. regexprep

Write a statement using regexprep that converts the string Hello 1 2 3 Goodbye to 1 2 3.

Personal tools