# 1. Definitions

• Database - A collection of data that is organized to allow efficient access, search, management, and update. Think of the electronic analog to a filing cabinet.
• Data in database are usually related (e.g., collection of temperature measurements).
• Access, search, management, and update efficiency are interdependent. Efficient update may mean inefficient search.
• DBMS - Database management system. MySQL, PostgreSQL, etc. Provide internal structure to the filing cabinet and allow:
• Management (organization of the data)
• Update (addition, modification, and removal)
• Retrieval (extraction of information by query)
• Administration (logging of queries, security, verification, deletion recovery)
• DBQL - Database query language. A language for making requests for information from the filing cabinet.
• SQL - Structured query language. A language used for interaction with a traditional relational database (putting data in and getting data out). Example: http://sqlfiddle.com/#!2/a2581/1
• NoSQL - Not only SQL. A general terms for a non-traditional database.
• Big Data - Information that can not be processed and/or handled and/or managed using common database software and tools.
• Flat files - An ASCII encoded file with very little structure. For example, a file with a header and one or more columns, each with the same number of rows. "Flat" means "little structure".
• ASCII files - An ASCII encoded file with data (typically created with printf() or equivalent).
• Binary files - A file with numbers that cannot be decoded with an ASCII reader (typically created with fwrite() or equivalent).
• Database model - The logical structure of a database.

# 2. Database and DBMS Definition

(From [1])

• A DB is a shared, integrated computer structure that stores a collection of:
• End-user data - Raw facts of interest to end user
• Metadata: Data about data, which the end-user data are integrated and managed
• Describe data characteristics and relationships
• Database management system (DBMS)
• Collection of programs
• Manages the database structure

Role of DBMS

• Intermediary between the user and the database
• Enables data to be shared
• Presents the end user with an integrated view of the data
• Receives and translates application requests into operations required to fulfill the requests
• Hides database’s internal complexity from the application programs and users

# 3. Database Types

(From [2])

• Single-user database: Supports one user at a time
• Desktop database: Runs on PC
• Multiuser database: Supports multiple users at the same time
• Workgroup databases: Supports a small number of users or a specific department
• Enterprise database: Supports many users across many departments
• Centralized database: Data is located at a single site
• Distributed database: Data is distributed across different sites
• Cloud database: Created and maintained using cloud data services that provide defined performance measures for the database
• General-purpose databases: Contains a wide variety of data used in multiple disciplines
• Discipline-specific databases: Contains data focused on specific subject areas
• Operational database: Designed to support a company’s day-to-day operations
• Analytical database: Stores historical data and business metrics used exclusively for tactical or strategic decision making
• Data warehouse: Stores data in a format optimized for decision support
• Unstructured data: It exists in their original state
• Structured data: It results from formatting
• Structure is applied based on type of processing to be performed
• Semistructured data: Processed to some extent
• Extensible Markup Language (XML)
• Represents data elements in textual format

# 4. Database Design

(From [3])

• Focuses on the design of the database structure that will be used to store and manage end-user data
• Well-designed database
• Facilitates data management
• Generates accurate and valuable information
• Poorly designed database causes difficult-to-trace errors

# 5. File System DB

Commonly used in science applications. A set of directories and files with a sensible naming convention. Data and metadata files are in same format and have the same structure.

Typical Examples:

From Why don't scientists use databases? [4]

Most scientists seem to avoid using databases. In this paper we examine twelve of the principal reasons given by scientists for preferring flat files and/or more structured files (such as NetCDF or HDF5). We then explain what database extension tools we have available for addressing some of these concerns, and what further changes are necessary in database management technology and practices in order to address the remaining concerns.

Twelve Reasons We Hear From Scientists For Not Using Databases

1. There are no benefits to using a database for the sort of work I do.
2. File systems are familiar to me; databases aren’t.
3. The data that I have collected in the past is all in files.
4. The new data I work with comes to me in the form of files.
5. The programs that I use work with files, not databases.
6. Files are easy to keep track of.
7. Files make handling schema evolution, versioning, and history easier.
8. Database data types are business - not science - oriented.
9. Database Management Systems are too expensive.
10. Databases are too complicated.
11. Complex interrelationships in a file (hierarchies, etc.) may be more difficult to represent in a database.
12. DBMS performance is poor.

• Management (organization of the data) - straightforward, but easy to invent your own conventions
• Update (addition, modification, and removal) - straightforward
• Retrieval (extraction of information by query) - usually requires special code to first ingest data into program and then query it
• Administration (logging of queries, security, verification, deletion recovery) - straightforward, but no standard way of doing it

# 6. Non-File System DB

Example from [5]

It would be difficult, if not impossible, to produce an alphabetical listing of employees based on their last names.

• To determine how many employees are certified in Basic Database Manipulation, you would need a program that counts the number of those certifications recorded in Skill1 and places it in a variable. Then the count of those certifications in Skill2 could be calculated and added to the variable. Finally, the count of those certifications in Skill3 could be calculated and added to the variable to produce the total.
• If you redundantly store the name of a skill with each employee who is certified in that skill, you run the risk of spelling the name differently for different employees. For example, the skill Basic Database Manipulation is also entered as Basic DB Manipulation for at least one employee in Figure 1.3, which makes it difficult to get an accurate count of employees who have the certification.
• The structure of the database will have to be changed by adding more columns to the table when an employee is certified in a fourth skill. It will have to be modified again if an employee is certified in a fifth skill.

Produce an alphabetical listing of employees by last name:

SELECT * FROM EMPLOYEE ORDER BY EMPLOYEE_LNAME;


Determine how many employees are certified in Basic Database Manipulation:

SELECT Count(*)
FROM SKILL JOIN CERTIFIED ON SKILL.SKILL_ID = CERTIFIED.SKILL_ID
WHERE SKILL_NAME = ‘Basic Database Manipulation’;


# 7. Discussion

• One project that we will do in this class is to store scientific data in a file system database and a relational database and then compare and contrast advantages and disadvantages of each.
• In this course, I will cover usage of "traditional databases", but priority will be given on usage of typical scientific databases.

• Database Systems [6]
• Why don't scientists use databases? [7]
• Scientific Data Management in the Coming Decade [8]
• Developing a Scientific Database [9]

Reports:

# 9. Problems

## 9.1. Information Definition

From Section 1.2 of [10]:

• "Data are raw facts."
• "Information is the result of processing raw data to reveal its meaning."

Find sentences from scientific literature that use the words data an information. Based on context, what is the implicit definition that the author intendend?

Most students found usage of "data" and "information" consistent with the text.

As an example how sometimes "data" and "information" are not clearly distinct, consider two dictionary definitions of "datum":

• "a single piece of information, as a fact, statistic, or code; an item of data [11]

## 9.2. Filing Cabinet Organization

You have a stack of 10,000 scientific papers, several filing cabinets, hanging folders, folders, blank labels, and a pen. How would you organize the papers to allow for efficient

1. Search (finding a paper by a certain author or topic)?
2. Management (how easy is it to set up and re-organize)?

Describe how the papers should be organized. Decribe how design decisions will affect other items.

Example:

My approach is to maximize efficiency of management; the papers should be stacked in the filing cabinets in any order without use of labels and folders. Papers should be added to bottom drawers and from left-to right cabinets so that when an update is needed, I have an idea of what drawer to look in for space. This system is easy to manage because very little thought is needed to set up the databases. This system is easy to re-organize because the content of one drawer can be exchanged with the content of another with very little thought. This database is very difficult to search. On average, I would need to look through 5,000 papers to find the one I want. This database is fairly easy to update - the only complication is that I need to find a drawer with available space. I can do this using a binary search algorithm.

## 9.3. Directory Structure Organization

You have 10,000 scientific papers in PDF format and an operating system with a text editor, symbolic links, and a file system directory browser. How would you organize the papers to allow for efficient

1. Search (finding a paper by a certain author or topic)?
2. Management (how easy is it to set up and re-organize)?

Describe how the papers should be organized to maximize effiency for each of the three items above. Describe how design decisions will affect other items.

Example:

1. To make search for a paper by author or topic efficient, I would create a text file with three columns "Authors", "Topics", and "File", e.g.,

Myfiles.txt

Smith and Smyth, solar wind; rainbows, C:\\Users\Weigel\PDFs\ja000sdf.pdf
Jones and Bones, cravings; skeletons, C:\\Users\Weigel\PDFs\sp1sdfsd_zz.pdf
...


When I needed to find a file, I would open the text file and search for a name or a topic and then open the associated file. To make opening the file possible with a click, I would create Myfiles.htm, which had an HTML table and hyperlinks that when clicked would open the PDF.

2. & 3. The above method allows for efficient management and update. Because all of the files are located in a single directory , re-organization is not needed. To add new papers, one only needs to add an entry in the text file.