Web Scraping

From ComputingForScientists

Jump to: navigation, search

Contents

  1. Overview
  2. Example I
  3. Example II
    1. Step I
    2. Step II
    3. Step III
    4. Step IV
    5. Step V
  4. Problems
    1. Example I
    2. Example I
    3. Example I
    4. Example I
    5. Example I
    6. Example I

1. Overview

Web Scraping, or Screen Scraping, is the process of extracting unstructured or semi-structured data from a web page.

In an ideal world, web sites holding data would provide an API (Application Programmer's Interface) that easily allowed programmers to request structured data in any convenient format (e.g., CSV file, XLS file, MATLAB binary file, etc.).

The reality is that we often need to write a significant amount of code to manipulate contents of a web page such that it can be used by analysis software. In this section, we give two examples. Keep in mind that there are many ways to achieve the same results. The approach that you take will depend on how fast you need the processing done, programs that you are familiar with, and how generalized you need your code to be.

2. Example I

Suppose that we wanted to plot data from the BOU magnetometer available at [1]. The contents of this directory change each day (an additional file is added and occasionally older data are removed). To automate the process of plotting the contents of this directory, we could first download a list of all files:

curl http://magweb.cr.usgs.gov/data/magnetometer/BOU/OneMinute/

Upon inspecting the results, it seems that the content that we want could be easily extracted using cut

curl http://magweb.cr.usgs.gov/data/magnetometer/BOU/OneMinute/ > tmp.html
cut -b 16-34 tmp.html

Some of the lines are blank, so we can send the result through grep

cut -b 16-34 tmp.html | grep "bou" 

and to save the list to a file, use

cut -b 16-34 tmp.html | grep "bou" > list.txt

and we now have a list of files in list.txt. To download all of the files, we could have created a list of commands to execute, e.g.,

cut -b 16-34 tmp.html | grep bou | sed 's#bou#wget http://magweb.cr.usgs.gov/data/magnetometer/BOU/OneMinute/bou#'

and then executed the commands using

cut -b 16-34 tmp.html | grep bou | sed 's#bou#wget http://magweb.cr.usgs.gov/data/magnetometer/BOU/OneMinute/bou#' | bash

As an alternative, we'll read in list.txt in MATLAB, and have MATLAB download and read the file. The following program reads each line of list.txt and displays it.

fid = fopen('list.txt','r');
while 1
  tline = fgetl(fid);
  if ~ischar(tline),break,end;
  tline
end
fclose(fid);

Next, we need to download each file displayed using urlread. The follwoing program does this for only the first file (the break command is used to stop processing of a loop):

fid = fopen('list.txt','r');
while 1
  tline = fgetl(fid);
  %if ~ischar(tline),break,end;
  url = ['http://magweb.cr.usgs.gov/data/magnetometer/BOU/OneMinute/',tline];
  urlwrite(url,tline);
  break;
end
fclose(fid);

MATLAB has saved the entire contents of the file. To be able to plot the data, we need to create arrays containing numbers. We can do this by reading the file line-by-line:

fid = fopen('list.txt','r');
while 1
  tline = fgetl(fid);
  %if ~ischar(tline),break,end;
  url = ['http://magweb.cr.usgs.gov/data/magnetometer/BOU/OneMinute/',tline];
  urlwrite(url,tline);

  fid2 = fopen(tline,'r');
  while 1
    tline2 = fgetl(fid2);
    if ~ischar(tline2),break,end;
    tline2
    break;  % Only read one line of file
  end
  fclose(fid2);
  break; % Only read one file
end
fclose(fid);

Next, we need to keep only lines with numbers. This can be done by inspecting the first character of each line. If it is a 2, the line has data. Given a data line, we then replace : and - characters with a space so that str2num can convert the string into an array of numbers:

   if (tline2(1) == '2')
     line = regexprep(tline2,':|-',' ');
     data(k,:) = str2num(line);
     k = k+1;
   end

(Note that this creates a problem with negative values turning into positive values - a method of avoiding this is to use two steps: line = regexprep(tline2,'([0-9])-','$1 '); line = regexprep(tline2,':',' ').)

The final code that reads the entire content of the first file is:

fid = fopen('list.txt','r');
while 1
  tline = fgetl(fid);
  %if ~ischar(tline),break,end;
  url = ['http://magweb.cr.usgs.gov/data/magnetometer/BOU/OneMinute/',tline];
  if ~exist(tline,'file') % If file is not on disk, download it.
    urlwrite(url,tline);
  end 

  fid2 = fopen(tline,'r');
  k = 1;
  while 1
    tline2 = fgetl(fid2);
    if ~ischar(tline2),break,end;
    if (tline2(1) == '2')
      line = regexprep(tline2,':|-',' ');
      data(k,:) = str2num(line);
      k = k+1;
    end
  end
  fclose(fid2);
  break;
end
fclose(fid);

whos data

3. Example II

After inspecting http://sohodata.nascom.nasa.gov/cgi-bin/data_query, we find some images that we would like to study. When we select HMI Continuum over the past week and look at the links associated with the returned images, we find that the URLs have a pattern. To download one image take at 1930 Universal Time over four days, we can use

curl -O http://sohowww.nascom.nasa.gov//data/REPROCESSING/Completed//2015/hmiigr/20150314/20150314_1930_hmiigr_512.jpg
curl -O http://sohowww.nascom.nasa.gov//data/REPROCESSING/Completed//2015/hmiigr/20150315/20150315_1930_hmiigr_512.jpg
curl -O http://sohowww.nascom.nasa.gov//data/REPROCESSING/Completed//2015/hmiigr/20150316/20150316_1930_hmiigr_512.jpg
curl -O http://sohowww.nascom.nasa.gov//data/REPROCESSING/Completed//2015/hmiigr/20150317/20150317_1930_hmiigr_512.jpg

which is equivalent to

curl http://sohowww.nascom.nasa.gov//data/REPROCESSING/Completed//2015/hmiigr/20150314/20150314_1930_hmiigr_512.jpg > 20150314_1930_hmiigr_512.jpg
curl http://sohowww.nascom.nasa.gov//data/REPROCESSING/Completed//2015/hmiigr/20150315/20150315_1930_hmiigr_512.jpg > 20150315_1930_hmiigr_512.jpg
curl http://sohowww.nascom.nasa.gov//data/REPROCESSING/Completed//2015/hmiigr/20150316/20150316_1930_hmiigr_512.jpg > 20150316_1930_hmiigr_512.jpg
curl http://sohowww.nascom.nasa.gov//data/REPROCESSING/Completed//2015/hmiigr/20150317/20150317_1930_hmiigr_512.jpg > 20150317_1930_hmiigr_512.jpg

Suppose that we want to create a 2x2 grid of solar images over the past two days.

To create a 2x2 grid, we can use the command montage:

montage -tile 2x2 20150314_1930_hmiigr_512.jpg 20150315_1930_hmiigr_512.jpg 20150316_1930_hmiigr_512.jpg 20150317_1930_hmiigr_512.jpg montage.jpg

The result is shown here:

Image:montage.jpg.

Suppose that you want to create a montage with 13 rows and 27 columns for all of the images from January 1, 2014 through December 17th, 2014 and you don't want to manually type the 351 curl commands required followed by a montage command with 351 file names.

To do this, you need to write a program that generates the commands that are needed for downloading and for creating the montage command.

This process is given in the steps below. In steps I-IV, commands are created for a 1x10 montage.

To install montage, execute

sudo apt-get update
sudo apt-get install imagemagick

on the command line. The command line program montage is a part of the imagemagick package. For more information about montage, enter man montage or visit [2].

3.1. Step I

Write a program in MATLAB/Octave/Python that displays to the screen curl commands for downloading images for the first 10 days in January 2015. The result should be

 curl -O http://sohowww.nascom.nasa.gov//data/REPROCESSING/Completed//2014/hmiigr/20140101/20140101_1930_hmiigr_512.jpg
 ...
 curl -O http://sohowww.nascom.nasa.gov//data/REPROCESSING/Completed//2014/hmiigr/20140110/20140110_1930_hmiigr_512.jpg

displayed to the screen, where the ... represents eight additional curl commands.

3.2. Step II

Modify the program created in Step I so that it writes the commands to a file named solar.sh.

Download the files by executing bash solar.sh on the command line.

3.3. Step III

In Step I, you wrote a program that displayed a list of download commands to the screen. The next task is to create a montage command. You want to create a program that displays the following string to the screen MATLAB/Octave/Python:

montage -tile +10+1 20140101_1930_hmiigr_512.jpg ... 20140110_1930_hmiigr_512.jpg solar2.jpg

where the ... represents the names of eight other files. This command must be executed from the directory in which the image files were downloaded.

3.4. Step IV

Modify the code in Step III so that it writes the string to a file named solar2.sh.

Create the montage by executing rm -f solar2.jpg ; bash solar2.sh on the command line. A file named solar2.jpg should have been created. You may view this file using display solar2.jpg on the command line or by selecting the image from a file browser.

(The command rm -f solar2.jpg is added because if you execute bash solar2.sh twice, the second time a file named solar2-0.jpg is created. Even if you request the output file to be named solar2.jpg, montage will not overwrite an existing file.)

Note that when you view solar2.jpg there will only be 9 images. A hint is given in a message that is displayed when bash solar2.sh was executed:

montage: Not a JPEG file: starts with 0x3c 0x21 `20140108_1930_hmiigr_512.jpg' @ error/jpeg.c/JPEGErrorHandler/322

If you execute ls -l in the directory of images, you will see that 20140108_1930_hmiigr_512.jpg is much smaller than the other files. The program montage claims that this file is not a jpeg file. To verify this, execute cat 20140108_1930_hmiigr_512.jpg. What you will see is text that indicates that the file was not found.

To address this, in solar2.sh, replace 20140108_1930_hmiigr_512.jpg with null: and execute rm -f solar2.jpg ; bash solar2.sh.

The null: instructs montage to place an empty tile. You should now have 9 images placed side-by-side with a gap in the location of the missing image.

3.5. Step V

Create a montage with 13 rows and 27 columns. If a file is not found for a given day, use a blank tile. Your final result should look like .



4. Problems

4.1. Example I

Download a file at [3] and use cut and grep to create a new file that only contains the column corresponding to BOUH. For example, for bou20130402vmin.min, the result should be 1440 lines, starting with

20819.61
20819.40
20819.38
...

4.2. Example I

Download a file at [4] and use sed and grep to create a new file that only contains

2013 04 02 00 00 00.000 092     20819.61    -47.89  47705.42  52585.28
2013 04 02 00 01 00.000 092     20819.40    -47.87  47705.47  52585.24
2013 04 02 00 02 00.000 092     20819.38    -47.86  47705.47  52585.24
...

4.3. Example I

Write a program that creates a list of files in [5] of the form

bou20130402vmin.min
bou20130403vmin.min
...
bou20130410vmin.min

Save this file as BOU.txt and push to your Bitbucket account.

4.4. Example I

Write a program (in any language) that creates a plot of the column BOUH for all of these files.

Any values of 99999 or larger should not be plotted (in MATLAB, if the value is NaN, it is not plotted).

4.5. Example I

Write a program that saves the data in column BOUH for all files listed in your output to question I (and located at http://mag.gmu.edu/git-data/cds302/) in a file named BOUH_20130402-20130410.txt with the form

Year MO DY HR MN BOUH
2013 04 02 00 00 20819.61
2013 04 02 00 01 20819.40
...
2013 04 10 23 59 47689.88

Save your program as Midterm_3c.EXT and push it to your Bitbucket account.

4.6. Example I

Write a program that saves the numbers in BOUH_20130402-20130410.txt as doubles so that the MATLAB/Octave program

 fid = fopen('BOUH_20130402-20130410.bin');
 A = fread(fid,'double');
 fclose(fid);
 A'

displays (corresponds to first two rows of file 20130402 and last row of 20130410)

  ans =

   1.0e+04 *

  Columns 1 through 8

   0.201300000000000   0.201300000000000   0.201300000000000   0.000400000000000   0.000400000000000   0.000400000000000   0.000200000000000   0.000200000000000

  Columns 9 through 16

   0.001000000000000                   0                   0   0.002300000000000                   0   0.000100000000000   0.005900000000000   2.081961000000000

  Columns 17 through 18

   2.081940000000000   4.768987999999999

 

Compare the time taken to read BOUH_20130402-20130410.bin with BOUH_20130402-20130410.txt.

Personal tools