layout: "../layouts/BlogPost.astro" title: "Rice course - intro to database" slug: rice-course-intro-to-database description: "" added: "Oct 21 2018"
Early Open Source Databases
SQLite is a software library that provides a relational database management system. If you have a Mac, sqlite3 should be available on there. https://github.com/kexiZeroing/node-sqlite-app
The Relational Model is usually denoted using LIKES (DRINKER, COFFEE)
<img alt="relation model" src="https://raw.gitmirror.com/kexiZeroing/blog-images/e267d5c748689a64a58514b9c6776a4499eff478/BD61-413F-89E1-3A76798F04B1.png" width="500">
Keys (If the key for two tuples are the same, they must be the same tuple)
Super key: An attribute or set of attributes that uniquely defines a tuple within a relation. However, a superkey may contain additional attributes that are not necessary for unique identification.
The foreign key must be an attribute or set of attributes that are uniquely identify a record in another table, and that combination of attribute values must be present in the other table.
Attribute Constraints
Table Level Constraints
PRIMARY KEY Constraint
Relational Algebra
<img alt="algebra" src="https://raw.gitmirror.com/kexiZeroing/blog-images/main/DC50FBCB-B408-4495-A8D9-99383F8B05E9.png" width="500">
Left / Right Outer Join
- R LEFT OUTER JOIN S ON
R.<att> = S.<att>
- LEFT JOIN and LEFT OUTER JOIN are the same
- 包括左表中的所有行以及右表中匹配的行,如果左表的某行在右表中没有匹配行,assigns NULLs
SQL
-- Who goes to a cafe serving Cold Brew?
SELECT DISTINCT f.DRINKER
FROM FREQUENTS AS f, SERVES AS s
WHERE f.CAFE = s.CAFE AND s.COFFEE = 'Cold Brew'
WHERE Clause
- <Attribute> = <value>
- <Attribute> BETWEEEN <value1> AND <value2>
- <Attribute> IN ([value1], [value2], ?)
- <Attribute> LIKE 'SST%'
- <Attribute> LIKE 'SST_'
- <Attribute> IS NULL
- <Attribute> IS NOT NULL
- Logical combinations with AND and OR
- We can also have a subquery in the WHERE clause (EXISTS / NOT EXISTS, IN / NOT IN )
/*
% represents zero, one or multiple characters
- represents a single character
*/
-- Who goes to a cafe serving Cold Brew?
SELECT DISTINCT f.DRINKER
FROM FREQUENTS AS f
WHERE EXISTS (
SELECT s.CAFE
FROM SERVES AS s
WHERE f.CAFE = s.CAFE AND s.COFFEE = 'Cold Brew'
)
/*
f 的每一行带入子查询,检查返回结果是否为 true
EXISTS 用于检查子查询是否至少会返回一行数据,该子查询实际上并不返回任何数据,而是返回 True/False
*/
Aggregations and Grouping
In database management, an aggregate function is a function where the values of multiple rows are grouped together to form a single value of more significant meaning.
SELECT r.COFFEE, AVG (r.SCORE)
FROM RATES AS r
GROUP BY r.COFFEE
r.COFFEE
, then the aggregate runs over each subgroup independently.SELECT r.COFFEE, AVG (r.SCORE) AS AVG_RATING
FROM RATES AS r
GROUP BY r.COFFEE
HAVING COUNT(*) >= 3
HAVING
is used to filter groups.HAVING
is applied after the group by phase whereas WHERE
is applied before the group by phase.What about NULL?
COUNT(1)
or COUNT(*)
will count every row.COUNT(<attribute>)
will count NOT NULL values.GROUP BY
includes a row for NULL.
count(1)
是在统计个数,可以想成表中有这么一个字段,这个字段就是固定值1,计算一共有多少个1。同理,count(2)
,count('x')
都是可以的。
CREATE TABLE
CREATE TABLE Student (
netId varchar(15) NOT NULL,
lastName varchar(100) NOT NULL,
firstName varchar(100) NOT NULL,
dateofbirth DATE NULL,
PRIMARY KEY (netId)
);
DROP TABLE
DROP TABLE [IF EXISTS] <tableName>;
ALTER TABLE
Populating Tables
INSERT INTO Room (classroomId, building, abbrev, room) VALUES
('DCH1070', 'Duncan Hall', 'DCH', '1070'),
('DCH1055', 'Duncan Hall', 'DCH', '1055'),
('HRZ211', 'Herzstein Hall', 'HRZ', '211'),
null
, which is the default values for those columns.Update and Delete
UPDATE <tableName> SET <attribute> = <value> WHERE <condition>;
DELETE FROM <tableName> WHERE <condition>;
Entity-Relationship Diagram (ERD)
<img alt="ERD" src="https://raw.gitmirror.com/kexiZeroing/blog-images/main/E2C8-43B6-8005-2C212621341C.png" width="500">
(a, b)
表示参与关系的最少数量和最多数量From ERD to Tables
1-1
: Foreign key for one of the entities to the other1-M
: Foreign key in M side referencing the 1 sideM-M
relationshipData Normalization
1st normal form
<img alt="1st normal form" src="https://raw.gitmirror.com/kexiZeroing/blog-images/main/E31B-43F2-867D-F010EF70B182.png" width="450">
2nd normal form
<img alt="2nd normal form" src="https://raw.gitmirror.com/kexiZeroing/blog-images/main/CC97-41DF-96DC-23D4B859F476.png" width="450">
3rd normal form
<img alt="3rd normal form" src="https://raw.gitmirror.com/kexiZeroing/blog-images/main/A0D7-4968-A52C-10240A6C7230.png" width="500">
BCNF
Thoughts on normalization
- Don't be afraid of having a lot of tables with just a few columns
- 3rd Normal Form is usually good enough
- Consider denormalization for performance reasons (introduces more tables which leads to more joins)
SQL Injection Attack
-- UserId: 105 OR 1=1
SELECT * FROM Users WHERE UserId = 105 OR 1=1;
-- UserId: 105; DROP TABLE Suppliers
SELECT * FROM Users WHERE UserId = 105; DROP TABLE Suppliers;
-- uName: " or ""="
-- uPass: " or ""="
-- sql = 'SELECT * FROM Users WHERE Name ="' + uName + '" AND Pass ="' + uPass + '"'
SELECT * FROM Users WHERE Name ="" or ""="" AND Pass ="" or ""=""
Indexing
B+ Tree is the most common type of storage for RDBMSs
An index is a structure like the one at the end of a book. At the end of a book you see several pages with words and where you can find those words. Those pages are an index and the same is the case for a database. The index contains key and their locations. It helps you finding the location of rows.
Transaction
Group of operations performed together, follow the ACID properties
START TRANSACTION;
SELECT balance FROM checking WHERE customer_id = 1;
UPDATE checking SET balance = balance - 100 WHERE customer_id = 1;
UPDATE savings SET balance = balance + 100 WHERE customer_id = 1;
COMMIT;
允许在一个事务中的操作语句会被其他事务的语句隔离开,比如事务 A 运行到第 3 行之后,第 4 行之前,此时事务 B 去查询 checking 余额时,仍然能够看到在事务 A 中减去 100 之前的余额,因为事务 A 和 B 是彼此隔离的。在事务 A 提交之前,事务 B 观察不到数据的改变。
Isolation Level
Dirty data: data written by a transaction that has not been committed
SET TRANSACTION ISOLATION LEVEL
How can we deal with failures? --Logs
undo 日志用于记录事务开始前的状态,用于事务失败时的回滚操作;redo 日志记录事务执行后的状态,用来恢复未写入 data file 的已成功事务更新的数据。例如某一事务 T1,其对数据 X 进行修改,设 X 的原值是 5,修改后的值为 15,那么 Undo 日志为 <T1, X, 5>
,Redo 日志为 <T1, X, 15>
.
redo 日志
当数据库对数据做修改的时候,需要把数据所在的 page 从磁盘读到 buffer pool 中,然后在 buffer pool 中进行修改,这时候 buffer 中的数据页就与磁盘上的数据页内容不一致(脏页),此时的数据尚在缓存中还没有持久化,如果 DB 出现故障就会导致丢失数据,因此,Innodb 会在 buffer pool 中的 data page 变更结束后,将相应的修改记录到 redo log 里进行持久化。redo log 记录某数据块被修改后的值,可以用来恢复未刷新到磁盘上的数据,并且总是日志先行,它先于脏页刷新到磁盘上。(在持久化数据文件前,保证之前的 redo 日志已经写到磁盘 flush the log to disk)
NoSQL
To address the need for storage of non-relational data
SQL -> MongoDB
<img alt="mongodb" src="https://raw.gitmirror.com/kexiZeroing/blog-images/main/6529-4D8C-8A31-CCA28EB64614.png" width="600">