Relational Databases

From ComputingForScientists

Jump to: navigation, search

Contents

  1. Introduction
  2. Definitions
  3. Basic SQL Syntax
    1. SELECT
    2. WHERE
    3. IN
    4. LIKE
    5. AND
    6. Problems
  4. Creating a SQLite DB
    1. Download and Unzip SQLite
    2. Command Line DB/Table Creation
      1. Create DB
      2. Creating TABLE
    3. Using a text file and command line
    4. Using Python
    5. Using c
    6. Problems
      1. SQL Statements
      2. Creating a SQLite DB
      3. Creating a SQL DB
  5. Creating a MySQL DB
  6. Activity I
    1. Download and Inspect
    2. Create a smaller CSV file
    3. Create a program to read a CSV file
    4. Problems
      1. 3 columns, 100 rows
      2. 3 columns, 1000 rows
      3. 3 columns, 10000 rows
      4. 3 columns, all rows
      5. All columns, 1000 rows
  7. Activity II
  8. Problems
    1. Create a SQLite DB
    2. Create a SQLite DB
    3. SQL
      1. Download and Inspect
      2. Create a smaller CSV file
      3. Create a program to read a CSV file
      4. Problems
        1. 3 columns, 100 rows
        2. 3 columns, 1000 rows
        3. 3 columns, 10000 rows
        4. 3 columns, all rows
        5. All columns, 1000 rows

1. Introduction

This section covers basic SQL syntax and the installation and use of a RDMS (SQLite).

2. Definitions

  • Database - A collection of data that is organized to allow efficient access, search, management, and update. Think of the electronic analog to a filing cabinet.
  • SQL - Structured query language. A language used for interaction with a traditional relational database
  • Relational Database - Tables of data, with each table containing related data with categories specified as column labels. Each row has a unique key (e.g., row number, SSN, student ID, etc.).
  • Schema - A description of how a database is structured (e.g., the tables in the DB, the column names and types for each table, etc.).

Example of a Relational Database


3. Basic SQL Syntax

SQL statements are case insensitive and alway end by a semi-colon. Table and column names may or may not be case sensitive (this and depends on the operating system and the SQL program (e.g., [1], even though the SQL standard specifies case sensitivity [2]).

The following are examples of basic SQL syntax on a hypothetical table.

3.1. SELECT

Select all columns from table named tablex:

SELECT * FROM tablex;

Select all rows from column named col1 from table named tablex:

SELECT col1 FROM tablex;

Select all rows from columns col1 and col2 from table named tablex:

SELECT col1,col2 FROM tablex;

3.2. WHERE

Select all rows and all columns from table named tablex provided the row has col2 > 10:

SELECT * FROM tablex WHERE col2 > 10;

Select all rows from column col1 provided that the value in of col2 > 10:

SELECT col1 FROM tablex WHERE col2 > 10;

3.3. IN

SELECT col1 FROM tablex WHERE col1 IN ('a','b','c')

3.4. LIKE

SELECT col1 FROM tablex WHERE col1 LIKE 'a%'

3.5. AND

Select all rows from column col1 provided that the value of col2 > 10 and the value of col1 < 10:

SELECT col1 FROM tablex WHERE col2` > 10 AND col1 < 10

3.6. Problems

Work through the problems at

  1. http://sqlzoo.net/wiki/SELECT_basics
  2. http://sqlzoo.net/wiki/SELECT_names
  3. http://sqlzoo.net/wiki/SELECT_from_WORLD_Tutorial

For more problems, see

You may also practice SQL commands using the Firefox add-on SQLite-Manager [3]

4. Creating a SQLite DB

See also [4] and [5]

SQL is a language used to query relational databases. In the same way that many different programs can interpret HTML and display it, many different programs can interpret SQL statements.

Here is a comparison of three RDMSs (MySQL, SQLite, and PostgreSQL): [6].

In this example, we'll use SQLite.

Note:

In what follows everything after # (or sqlite>) should be entered on the Bash (or sqlite) command line; # (or sqlite>) should be omitted.

4.1. Download and Unzip SQLite

Open a shell and download SQLite:

Linux:

# wget http://www.sqlite.org/2015/sqlite-shell-linux-x86-3080801.zip

OS-X

# curl http://www.sqlite.org/2015/sqlite-shell-osx-x86-3080801.zip > sqlite-shell-osx-x86-3080801.zip

Then unzip it:

Linux:

# unzip sqlite-shell-linux-x86-3080801.zip

OS-X

# unzip sqlite-shell-osx-x86-3080801.zip

4.2. Command Line DB/Table Creation

4.2.1. Create DB

Now start sqlite and create a database called demo

# ./sqlite3 demo.db

You should see something similar to

SQLite version 3.8.8.1 2015-01-20 16:51:25
Enter ".help" for usage hints.
sqlite> 

If you enter (In what follows, only enter the text after sqlite>.)

sqlite>.databases

you should see something similar to

seq  name             file
---  ---------------  -------------------------------
0    main             /home/weigel/git/cds302/demo.db

(Note that if you want to quit and start over, exit by typing sqlite>.exit and then # rm demo.db.)

4.2.2. Creating TABLE

See also [7] and [8].

If you enter

sqlite>.databases

you should see a database named demo. To create a table in this database, first specify column names, and the type of data [9] that will be stored in the columns:

sqlite>CREATE TABLE COMPANY(
  ID             INTEGER,
  NAME           TEXT,
  AGE            INTEGER,
  ADDRESS        TEXT,
  SALARY         REAL
);

If you enter

sqlite>.tables

you should see

COMPANY

Next, enter the following statements (omit the sqlite> prompt when entering the text below):

sqlite> INSERT INTO COMPANY (ID,NAME,AGE,ADDRESS,SALARY) VALUES (1, 'Paul', 32, 'California', 20000.00 );
sqlite> INSERT INTO COMPANY (ID,NAME,AGE,ADDRESS,SALARY) VALUES (2, 'Allen', 25, 'Texas', 15000.00 );
sqlite> INSERT INTO COMPANY (ID,NAME,AGE,ADDRESS,SALARY) VALUES (3, 'Teddy', 23, 'Norway', 20000.00 );
sqlite> INSERT INTO COMPANY (ID,NAME,AGE,ADDRESS,SALARY) VALUES (4, 'Mark', 25, 'Rich-Mond ', 65000.00 );
sqlite> INSERT INTO COMPANY (ID,NAME,AGE,ADDRESS,SALARY) VALUES (5, 'David', 27, 'Texas', 85000.00 );
sqlite> INSERT INTO COMPANY (ID,NAME,AGE,ADDRESS,SALARY) VALUES (6, 'Kim', 22, 'South-Hall', 45000.00 );

And display the table contents using

sqlite>SELECT * FROM COMPANY;

Finally, exit sqlite by entering

sqlite>.exit

and inspect the database file that it created:

# ls -l demo.db
You should see something similar to
-rw-r--r-- 1 user user 3072 Jan 30 13:07 demo.db

4.3. Using a text file and command line

The database file created in the previous step is in a special binary format that is not meant to be human readable. The following command creates a human readable file

# ./sqlite3 demo.db .dump > demo.sql

You can open the file demo.sql using a text editor or view it on the command line by typing

# cat demo.sql

(The command cat is used to concatenate file or to view the content of a file as was done here.)

The file demo.db is in a special format that allows SQLite to quickly read it. The file demo.sql has the same information but in a human readable form.

Suppose that you want to create a database without using typing the commands on the sqlite command line. You could create a file with a format similar to demo.sql and import it into SQLite. To do this, open a text editor and copy the following contents and save as txtdemo.sql in the same directory of demo.sql. (The following is a slightly modified version of demo.sql.)

PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE TXTCOMPANY(
   ID             INTEGER,
   NAME           TEXT,
   AGE            INTEGER,
   ADDRESS        TEXT,
   SALARY         REAL
 );
INSERT INTO TXTCOMPANY (ID,NAME,AGE,ADDRESS,SALARY) VALUES (1, 'Paul', 32, 'California', 20000.00 );
INSERT INTO TXTCOMPANY (ID,NAME,AGE,ADDRESS,SALARY) VALUES (2, 'Allen', 25, 'Texas', 15000.00 );
INSERT INTO TXTCOMPANY (ID,NAME,AGE,ADDRESS,SALARY) VALUES (3, 'Teddy', 23, 'Norway', 20000.00 );
INSERT INTO TXTCOMPANY (ID,NAME,AGE,ADDRESS,SALARY) VALUES (4, 'Mark', 25, 'Rich-Mond ', 65000.00 );
INSERT INTO TXTCOMPANY (ID,NAME,AGE,ADDRESS,SALARY) VALUES (5, 'David', 27, 'Texas', 85000.00 );
INSERT INTO TXTCOMPANY (ID,NAME,AGE,ADDRESS,SALARY) VALUES (6, 'Kim', 22, 'South-Hall', 45000.00 );
COMMIT;
A common error is to end the CREATE TABLE block with SALARY REAL, (with a comma after REAL). The error message will be syntax error on line 3 near ")". From the perspective of sqlite, line 3 starts with CREATE TABLE and ends with the semicolon on line 9 - the ")" being referred to is on line 9 of the file. The issue is that the ")" is preceded by a comma and sqlite is expecting additional text before the ")". Also note that the CREATE TABLE command could have been written on one line:
CREATE TABLE TXTCOMPANY(ID INTEGER, NAME TEXT, AGE INTEGER, ADDRESS TEXT, SALARY REAL);

because extra spaces or newlines are ignored.

From the command line, execute (the first command deletes any database file that already exists)

# rm -f txtdemo.db 
# ./sqlite3 txtdemo.db < txtdemo.sql

This creates a database named txtdemo.db and executes the commands in the file txtdemo.sql. Start SQLite and verify that a table named TXTCOMPANY exists

# ./sqlite3 txtdemo.db
# sqlite>.tables
TXTCOMPANY
sqlite> select * from TXTCOMPANY;

4.4. Using Python

There are two ways Python could be used to create a SQLite database. First, one could write a Python program that creates a text file. Or, one could use a library for Python that allows one to directly insert data into the database. In this example we use the Python library sqlite3 to directly insert data.

Copy the following in a text editor and then enter

rm pydemo.db; python pydemo.py

on the command line.

import sqlite3
conn = sqlite3.connect('pydemo.db')
emp1 = "(1,'Paul',32,'California',20000.00)";
emp2 = "(2,'Allen',25,'Texas',15000.00)";
emp3 = "(3, 'Teddy', 23, 'Norway', 20000.00 )";
emp4 = "(4, 'Mark', 25, 'Rich-Mond ', 65000.00 )";
emp5 = "(5, 'David', 27, 'Texas', 85000.00 )";
emp6 = "(6, 'Kim', 22, 'South-Hall', 45000.00 )";
emps = [emp1,emp2,emp3,emp4,emp5,emp6]

c = conn.cursor()
c.execute('CREATE TABLE COMPANY(ID INTEGER, NAME TEXT, AGE INTEGER, ADDRESS TEXT, SALARY REAL);')

for x in emps:
        c.execute('INSERT INTO COMPANY VALUES ' + x)

conn.commit()
conn.close()

Then, verify that the database was properly created by entering

# ./sqlite3 pydemo.db
sqlite>.tables

and then

sqlite>select * from TXTCOMPANY;

4.5. Using c

See [10]

4.6. Problems

4.6.1. SQL Statements

Write a SQL statement that selects displays all rows in COMPANY for which the salary is above 20000.

Write a SQL statement that selects displays all rows in COMPANY for which the age is equal to 25.

4.6.2. Creating a SQLite DB

(Note - to install a text editor, enter sudo apt-get install gedit on the command line. When the installation is done, you can start the editor by entering gedit on the command line.)

Using a text editor, create a SQL table named WOW with two columns, named ID and ID2. Save the file as tworows.sql. The rows should be

1 2
3 4

Create the database using

# ./sqlite3 tworows.db < tworows.sql
If successful, entering
# ./sqlite3 tworows.db

and then

sqlite> select * from WOW;

should result in the following to be displayed

1|2
3|4


4.6.3. Creating a SQL DB

Use MATLAB/Octave (or any other programming language) to create a text file with the following contents (note that the ... represents 496 lines that should exist in the file that you create).

(To install Octave, enter sudo apt-get install octave)

PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE NUMBERS(ID INTEGER, ID2 INTEGER);
INSERT INTO "NUMBERS" VALUES(1,500);
INSERT INTO "NUMBERS" VALUES(2,499);
...
INSERT INTO "NUMBERS" VALUES(499,2);
INSERT INTO "NUMBERS" VALUES(500,1);
COMMIT;

Hint: Build on this MATLAB/Octave program:

fid = fopen('matlabsql.sql','w');
fprintf(fid,'PRAGMA foreign_keys=OFF;\n');
fprintf(fid,'BEGIN TRANSACTION;\n CREATE TABLE NUMBERS(ID INTEGER,ID2 INTEGER);');
fprintf(fid,'INSERT INTO "NUMBERS" VALUES(%d,%d);\n',1,500);
fprintf(fid,'INSERT INTO "NUMBERS" VALUES(%d,%d);\n',2,499);
fprintf(fid,'COMMIT;\n');
fclose(fid);

Read the database into SQLite and write a query that reports the number of rows for which ID is equal to ID2.

5. Creating a MySQL DB

Save the following as demo.mysql and execute

mysql -u root -pPASSWORD < demo.mysql

where PASSWORD is the password set for the root account during installation of mysql. Not that there is not a space between p and P.

To verify that the database and table was created properly, execute

mysql -u root -pPASSWORD  -e "USE demo;SHOW tables;SELECT * FROM TXTCOMPANY;"

(The -e argument executes the command that follows it in mysql as if you had typed it on the mysql> command line.)

Or, start mysql

# mysql -u root -pPASSWORD

and enter

mysql> USE demo; SHOW tables;SELECT * FROM TXTCOMPANY;
DROP DATABASE IF EXISTS	demo;
CREATE DATABASE demo;
USE demo;
CREATE TABLE TXTCOMPANY(
   ID             INTEGER,
   NAME           TEXT,
   AGE            INTEGER,
   ADDRESS        TEXT,
   SALARY         REAL
 );
INSERT INTO TXTCOMPANY (ID,NAME,AGE,ADDRESS,SALARY) VALUES (1, 'Paul', 32, 'California', 20000.00 );
INSERT INTO TXTCOMPANY (ID,NAME,AGE,ADDRESS,SALARY) VALUES (2, 'Allen', 25, 'Texas', 15000.00 );
INSERT INTO TXTCOMPANY (ID,NAME,AGE,ADDRESS,SALARY) VALUES (3, 'Teddy', 23, 'Norway', 20000.00 );
INSERT INTO TXTCOMPANY (ID,NAME,AGE,ADDRESS,SALARY) VALUES (4, 'Mark', 25, 'Rich-Mond ', 65000.00 );
INSERT INTO TXTCOMPANY (ID,NAME,AGE,ADDRESS,SALARY) VALUES (5, 'David', 27, 'Texas', 85000.00 );
INSERT INTO TXTCOMPANY (ID,NAME,AGE,ADDRESS,SALARY) VALUES (6, 'Kim', 22, 'South-Hall', 45000.00 );

6. Activity I

In this activity, you will take an existing database in CSV format (comma separated values) and convert it to a SQLite database. Next, you will perform searches on your database and compare with the results of an existing search interface that operates on the same data.

The entire contents of the database at [13] have been downloaded and placed at http://mag.gmu.edu/astro.csv.gz. The file astro.csv.gz was created by selecting All in all of the selection menus in the section "Additional Constraints". (Don't do this - it will take a very long time!)

6.1. Download and Inspect

Use curl or wget to download astro.csv.gz

curl http://mag.gmu.edu/tmp/astro.csv.gz > astro.csv.gz

or

wget http://mag.gmu.edu/tmp/astro.csv.gz

Note that by default, wget creates a file named astro.csv.gz where curl requires you to send the output to a file with a name of your choosing.

Inspect the file

ls -lh astro.csv.gz

Unzip the file (this will take a while)

gunzip astro.csv.gz

and inspect the size of the unzipped file

ls -lh astro.csv

and count how many lines the file contains

wc astro.csv

and make sure the file looks like what is expected

head -2 astro.csv

(The command head -2 displays the first 2 lines of the file. (Code tail -2 would show the last two lines.)

6.2. Create a smaller CSV file

The unzipped file is 4.7 GB and we want to convert this file to a SQLite database. There are several ways to do this (covered in RDMS#Creating_a_SQL_DB):

  1. By entering values one-by-one at the sqlite> prompt;
  2. By creating a text file manually with one insert command for each line of the file to create the database (6,854,514 according to wc);
  3. By using a program to create a text file with SQL commands to create the database; and
  4. By using a program to create the database without creating an intermediate text file.

We are going to use approach 3. and 4.

Because this file is so large, evertime we work with it a command takes a long time to execute (wc took 30 seconds). The first thing we'll do is create a smaller version of the database.

head -10 astro.csv > astro10.csv

6.3. Create a program to read a CSV file

Save the following to a program named a3a.py in the same directory as astro10.csv

import csv
with open('astro10.csv','r') as csvfile:
     contents = csv.reader(csvfile, delimiter=',')
     for row in contents:
     	 print row

execute on the command line

python a3a.py

and inspect the result. Each time the print command is called, it displays a line, but notice that each field is quoted. Each row is stored as a list in Python. To print a only a certain element of the list, replace

print row

with

print row[0]

or

print 'The first field in this row is ' + row[0]

Now we have enough information about Python to create a sequence of commands to create a table and to populate the table with data.

import csv

print 'PRAGMA foreign_keys=OFF;'
print 'BEGIN TRANSACTION;'
print 'CREATE TABLE ASTRO (Preview text, Collection text, Obs_ID text);'

with open('astro10.csv','r') as csvfile:
     contents = csv.reader(csvfile, delimiter=',')
     for row in contents:
         print "INSERT INTO ASTRO (Preview, Collection, Obs_ID) VALUES ('" + row[0] + "','" + row[1] + "','" + row[2] + "');"

print 'COMMIT;'

If the above is saved as a3b.py, executing

python a3b.py

should show a series of SQL commands. Dump these commands to a file using

python a3b.py > astro10.sql

Check the file

cat astro10.sql

and then import into sqlite3

rm -f astro10.db; ./sqlite3 astro10.db < astro10.sql

Recall that ./sqlite3 astro10.db is used to create a database named astro10.db and that < astro10.sql causes the commands in the file to be processed as if they were hand-entered on the command line. (The rm command is there because if the file astro10.db already exists, the sqlite3 command will give an error.)

Note that you can time how long commands take using the command time. For example, try

rm -f astro10.db;
time ./sqlite3 astro10.db < astro10.sql

Finally, make sure things worked

# ./sqlite3 astro10.db
sqlite> SELECT * from ASTRO;

you should see

Preview|Collection|Obs. ID
caom:CFHT/792179|CFHT|792179
caom:CFHT/792179|CFHT|792179
caom:CFHT/688336|CFHT|688336
caom:CFHT/688351|CFHT|688351
caom:CFHT/792250|CFHT|792250
caom:CFHT/792250|CFHT|792250
caom:CFHT/792254|CFHT|792254
caom:CFHT/688506|CFHT|688506
caom:CFHT/688702|CFHT|688702

6.4. Problems

6.4.1. 3 columns, 100 rows

Follow the steps used to create a database containing the first 10 rows of astro.csv so that the first 100 rows of astro.csv are used.

Use time to determine how long it took to create the database (to process the .sql file).

6.4.2. 3 columns, 1000 rows

Same as previous problem except with 1000 rows.

6.4.3. 3 columns, 10000 rows

Same as previous problem except with 10000 rows.

Now predict how long it will take to create the full database with ~7 million rows.

6.4.4. 3 columns, all rows

The first task is to verify the database you created is consistent with that of [14].

Think of queries that you can run on your 3 column database for comparison with what is shown at [15].

For example, the web page indicates that there are a total of 20 unique collections (CFHT, HST, etc.). Try to write a SQL query that counts the number of unique collection names in the Collection column. (Such queries were not covered in the notes. You'll have to search the web for examples.)

Challenge

Use the Bash commands grep, wc, cut, and unique to count the number of unique collection names in the Collection column of astro.csv.

What are other queries that you could run to check for consistency?

6.4.5. All columns, 1000 rows

Thus far, we only have a database with three columns. Modify the Python code so that it creates a database with columns corresponding to the first line of astro.csv. Note that some characters are not allowed in column names (period, space, parenthesis, single quote, double quote, etc.). You will have to modify the column names as needed (for example, as was done in the example where Obs_ID was used instead of Obs. ID).

Use time to determine how long it took to create the database.

Challenge

Same as previous problem except with 10000 rows. You will get an error. Figure out why and fix the problem.

7. Activity II

Copy and paste the text below into a file named activityIIa.sh. On the command line in the same directory as this file, execute

# bash activityIIa.sh

which creates a SQLite database named activityII.db.

#!/bin/bash

# Data are described at
# http://data.worldbank.org/indicator/NY.GDP.MKTP.CD?page=5

# Data file ./data/ny.gdp.mktp.cd_Indicator_en_csv_v2.csv
# was created by executing the following commands:
#  mkdir data;
#  curl "http://api.worldbank.org/v2/en/indicator/ny.gdp.mktp.cd?downloadformat=csv" > ./data/ny.gdp.mktp.cd_Indicator_en_csv
_v2.zip 
#  cd data; unzip ny.gdp.mktp.cd_Indicator_en_csv_v2.zip 

# A copy of this zip file is available at
# http://mag.gmu.edu/git-data/cds302/ny.gdp.mktp.cd_Indicator_en_csv_v2.zip

# If file does not exist, download and expand zip file it is in
if [ ! -f data/ny.gdp.mktp.cd_Indicator_en_csv_v2.csv ]; then
    mkdir data
    curl http://mag.gmu.edu/git-data/cds302/ny.gdp.mktp.cd_Indicator_en_csv_v2.zip > data/ny.gdp.mktp.cd_Indicator_en_csv_v2.
zip    
    cd data; unzip ny.gdp.mktp.cd_Indicator_en_csv_v2.zip
fi

# Place first two lines for DB creation in activityII-head.sql
echo "BEGIN TRANSACTION;" > activityII-head.sql
echo "CREATE TABLE NY_GDP_MKTP_CD_Indicator (CountryName TEXT,CountryCode TEXT,IndicatorName TEXT,IndicatorCode TEXT,YEAR1960
 NUMBER,YEAR1961 NUMBER,YEAR1962 NUMBER,YEAR1963 NUMBER,YEAR1964 NUMBER,YEAR1965 NUMBER,YEAR1966 NUMBER,YEAR1967 NUMBER,YEAR1
968 NUMBER,YEAR1969 NUMBER,YEAR1970 NUMBER,YEAR1971 NUMBER,YEAR1972 NUMBER,YEAR1973 NUMBER,YEAR1974 NUMBER,YEAR1975 NUMBER,YE
AR1976 NUMBER,YEAR1977 NUMBER,YEAR1978 NUMBER,YEAR1979 NUMBER,YEAR1980 NUMBER,YEAR1981 NUMBER,YEAR1982 NUMBER,YEAR1983 NUMBER
,YEAR1984 NUMBER,YEAR1985 NUMBER,YEAR1986 NUMBER,YEAR198 NUMBER,YEAR1988 NUMBER,YEAR1989 NUMBER,YEAR1990 NUMBER,YEAR1991 NUMB
ER,YEAR1992 NUMBER,YEAR1993 NUMBER,YEAR1994 NUMBER,YEAR1995 NUMBER,YEAR1996 NUMBER,YEAR1997 NUMBER,YEAR1998 NUMBER,YEAR1999 N
UMBER,YEAR2000 NUMBER,YEAR2001 NUMBER,YEAR2002 NUMBER,YEAR2003 NUMBER,YEAR2004 NUMBER,YEAR2005 NUMBER,YEAR2006 NUMBER,YEAR200
7 NUMBER,YEAR2008 NUMBER,YEAR2009 NUMBER,YEAR2010 NUMBER,YEAR2011 NUMBER,YEAR2012 NUMBER,YEAR2013 NUMBER,YEAR2014);" >> activ
ityII-head.sql

# A string that will be referenced later.
PRE="INSERT INTO NY_GDP_MKTP_CD_Indicator (CountryName,CountryCode,IndicatorName,IndicatorCode,YEAR1960,YEAR1961,YEAR1962,YEA
R1963,YEAR1964,YEAR1965,YEAR1966,YEAR1967,YEAR1968,YEAR1969,YEAR1970,YEAR1971,YEAR1972,YEAR1973,YEAR1974,YEAR1975,YEAR1976,YE
AR1977,YEAR1978,YEAR1979,YEAR1980,YEAR1981,YEAR1982,YEAR1983,YEAR1984,YEAR1985,YEAR1986,YEAR198,YEAR1988,YEAR1989,YEAR1990,YE
AR1991,YEAR1992,YEAR1993,YEAR1994,YEAR1995,YEAR1996,YEAR1997,YEAR1998,YEAR1999,YEAR2000,YEAR2001,YEAR2002,YEAR2003,YEAR2004,Y
EAR2005,YEAR2006,YEAR2007,YEAR2008,YEAR2009,YEAR2010,YEAR2011,YEAR2012,YEAR2013,YEAR2014) VALUES ("

# Relace single quotes (occur in country names) with two single quotes
# (escape for ' is ''):
# perl -pe "s/'/''/g" 

# Replace double quotes with single quotes:
# perl -pe "s/\"/'/g"

# Prefix existing line with string PRE
# perl -pe "s/^(.*)/$PRE\1/g"

# Replace comma at end of line with ");\n"
# perl -pe 's/,\r\n/);\n/'

# Replace '' with NULL
# perl -pe "s/''/NULL/g"

# Send the out put to a file named activityII-tail.sql
# > activityII-tail.sql

tail -n 248 ./data/ny.gdp.mktp.cd_Indicator_en_csv_v2.csv | perl -pe "s/'/''/g" | perl -pe "s/\"/'/g" | perl -pe "s/^(.*)/$PR
E\1/g" | perl -pe 's/,\r\n/);\n/' | perl -pe "s/''/NULL/g" > activityII-tail.sql

# Create string "COMMIT;" and send to cat.  activityII-head.sql and activityII.tail are
# echo "COMMIT;" | cat activityII-head.sql activityII-tail.sql -

# Remove ' when it is preceeded or followed by 0-9.
# perl -pe "s/([0-9])'|'([0-9])/\1/g" 

# Send to file activityII.sql
# > activityII.sql
echo "COMMIT;" | cat activityII-head.sql activityII-tail.sql - | perl -pe "s/([0-9])'|'([0-9])/\1/g" > activityII.sql

# Remove existing db file if it exists.
rm -f activityII.db;

# Create activityII.db
sqlite3 activityII.db < activityII.sql
rm -f activityII-head.sql activityII-tail.sql

8. Problems

8.1. Create a SQLite DB

Covered in RDMS

Create a SQLite DB named market.db using the commands [17]. Information about this data is given at RDMS#Activity_II.

Execute queries on this DB that results in a display of

  • All rows and all columns of the only table in this database
  • All rows and all columns for Ireland and Iceland
  • The number of rows for which the value for the GDP in 2011 was not NULL
  • The country name and GDP in 2013 for rows where the GDP in 2013 was not NULL

8.2. Create a SQLite DB

The file [18] is the content of the XLSX file [19] mentioned at [20]. (The CSV file was created by opening the XLSX file in a spreadsheet program and then exporting as CSV.)

  • Use any approach to create a SQLite database containing this information.
  • Write a query that counts the total number of medals won by athletes from the United States.

8.3. SQL

In this problem, you will take an existing database in CSV format (comma separated values) and convert it to a SQLite database. Next, you will perform searches on your database and compare with the results of an existing search interface that operates on the same data.

The entire contents of the database at [22] have been downloaded and placed at http://mag.gmu.edu/astro.csv.gz. The file astro.csv.gz was created by selecting All in all of the selection menus in the section "Additional Constraints". (Don't do this - it will take a very long time!)

# curl http://mag.gmu.edu/tmp/astro.csv.gz > astro.csv.gz
# gunzip astro.csv.gz
# head -10 astro.csv > astro10.csv
# python HW1a.py (where astro.py is this file)
# sqlite3 astro.db < astro.sql

8.3.1. Download and Inspect

Use curl or wget to download astro.csv.gz

curl http://mag.gmu.edu/tmp/astro.csv.gz > astro.csv.gz

or

wget http://mag.gmu.edu/tmp/astro.csv.gz

Note that by default, wget creates a file named astro.csv.gz where curl requires you to send the output to a file with a name of your choosing.

Inspect the file

ls -lh astro.csv.gz

Unzip the file (this will take a while)

gunzip astro.csv.gz

and inspect the size of the unzipped file

ls -lh astro.csv

and count how many lines the file contains

wc astro.csv

and make sure the file looks like what is expected

head -2 astro.csv

(The command head -2 displays the first 2 lines of the file. (Code tail -2 would show the last two lines.)

8.3.2. Create a smaller CSV file

The unzipped file is 4.7 GB and we want to convert this file to a SQLite database. There are several ways to do this (covered in RDMS#Creating_a_SQL_DB):

  1. By entering values one-by-one at the sqlite> prompt;
  2. By creating a text file manually with one insert command for each line of the file to create the database (6,854,514 according to wc);
  3. By using a program to create a text file with SQL commands to create the database; and
  4. By using a program to create the database without creating an intermediate text file.

We are going to use approach 3. and 4.

Because this file is so large, evertime we work with it a command takes a long time to execute (wc took 30 seconds). The first thing we'll do is create a smaller version of the database.

head -10 astro.csv > astro10.csv

8.3.3. Create a program to read a CSV file

Save the following to a program named a3a.py in the same directory as astro10.csv

import csv
with open('astro10.csv','r') as csvfile:
     contents = csv.reader(csvfile, delimiter=',')
     for row in contents:
     	 print row

execute on the command line

python a3a.py

and inspect the result. Each time the print command is called, it displays a line, but notice that each field is quoted. Each row is stored as a list in Python. To print a only a certain element of the list, replace

print row

with

print row[0]

or

print 'The first field in this row is ' + row[0]

Now we have enough information about Python to create a sequence of commands to create a table and to populate the table with data.

import csv

print 'PRAGMA foreign_keys=OFF;'
print 'BEGIN TRANSACTION;'
print 'CREATE TABLE ASTRO (Preview text, Collection text, Obs_ID text);'

with open('astro10.csv','r') as csvfile:
     contents = csv.reader(csvfile, delimiter=',')
     for row in contents:
         print "INSERT INTO ASTRO (Preview, Collection, Obs_ID) VALUES ('" + row[0] + "','" + row[1] + "','" + row[2] + "');"

print 'COMMIT;'

If the above is saved as a3b.py, executing

python a3b.py

should show a series of SQL commands. Dump these commands to a file using

python a3b.py > astro10.sql

Check the file

cat astro10.sql

and then import into sqlite3

rm -f astro10.db; ./sqlite3 astro10.db < astro10.sql

Recall that ./sqlite3 astro10.db is used to create a database named astro10.db and that < astro10.sql causes the commands in the file to be processed as if they were hand-entered on the command line. (The rm command is there because if the file astro10.db already exists, the sqlite3 command will give an error.)

Note that you can time how long commands take using the command time. For example, try

rm -f astro10.db;
time ./sqlite3 astro10.db < astro10.sql

Finally, make sure things worked

# ./sqlite3 astro10.db
sqlite> SELECT * from ASTRO;

you should see

Preview|Collection|Obs. ID
caom:CFHT/792179|CFHT|792179
caom:CFHT/792179|CFHT|792179
caom:CFHT/688336|CFHT|688336
caom:CFHT/688351|CFHT|688351
caom:CFHT/792250|CFHT|792250
caom:CFHT/792250|CFHT|792250
caom:CFHT/792254|CFHT|792254
caom:CFHT/688506|CFHT|688506
caom:CFHT/688702|CFHT|688702

8.3.4. Problems

Solutions: See RDMS#Activity

8.3.4.1. 3 columns, 100 rows

Follow the steps used to create a database containing the first 10 rows of astro.csv so that the first 100 rows of astro.csv are used.

Use time to determine how long it took to create the database (to process the .sql file).

8.3.4.2. 3 columns, 1000 rows

Same as previous problem except with 1000 rows.

8.3.4.3. 3 columns, 10000 rows

Same as previous problem except with 10000 rows.

Now predict how long it will take to create the full database with ~7 million rows.

8.3.4.4. 3 columns, all rows

The first task is to verify the database you created is consistent with that of [23].

Think of queries that you can run on your 3 column database for comparison with what is shown at [24].

For example, the web page indicates that there are a total of 20 unique collections (CFHT, HST, etc.). Try to write a SQL query that counts the number of unique collection names in the Collection column. (Such queries were not covered in the notes. You'll have to search the web for examples.)

Challenge

Use the Bash commands grep, wc, cut, and unique to count the number of unique collection names in the Collection column of astro.csv.

What are other queries that you could run to check for consistency?

8.3.4.5. All columns, 1000 rows

Thus far, we only have a database with three columns. Modify the Python code so that it creates a database with columns corresponding to the first line of astro.csv. Note that some characters are not allowed in column names (period, space, parenthesis, single quote, double quote, etc.). You will have to modify the column names as needed (for example, as was done in the example where Obs_ID was used instead of Obs. ID).

Use time to determine how long it took to create the database.

Challenge

Same as previous problem except with 10000 rows. You will get an error. Figure out why and fix the problem.

Personal tools