Common SQL Functions for Data Scientist

Yue Weng Mak
Level Up Coding
Published in
9 min readApr 17, 2020

--

SQL functions that every data scientist should know.

Photo by Tobias Fischer on Unsplash

What is SQL?

Structured Query Language (SQL) allows you to query from a database. You can perform different actions on the database including:

  • Adding a record into the database
  • Updating a record into the database
  • Removing a record into the database
  • Creating tables

What is a Database?

A database is a set of data that is stored and easily retrieved. There are different kinds of databases, such as Relational Databases and NoSQL databases. In this example, I will be using Relational Databases. Relational Database means there is a structure in our data. We have columns (known as fields) and rows (known as records) in our table. MySQL is an example of a Relational Database. An example of NoSQL database is MongoDB. To learn more about NoSQL databases, click here.

Structure of a Database

There are a few components to a database:

  • Table
  • Fields (Columns in a table)
  • Records (Rows in a table)

A relational database consists of one or more tables. Each table represents an entity of the database. For example, if we have a company database, a table can represent the employees of the company or the clients of the company. Each table is a grid, with columns and rows. The columns are known as fields in the table and the rows are known as records in the table. In an employee’s table, a field can be the name of the employee, age of employee and the salary of the employee. A row is created as a record in the database. If there are 20 employees in the company, there will be 20 records in the employee’s table.

Each record will correspond to the different fields in the table. For example, if the table has fields of an employee’s name, age, and salary, it may look like this:

Employees table
name age salary
Lisa 32 80000

As you can see, the values can be a string, numeric number, etc. The values in these fields are allowed to be any of these data types:

  • String
  • Int (Integer)
  • Float
  • Boolean

Let’s start with an example.

Let’s assume that you are a sports director at school X. You have a spreadsheet of the different sports that the school offers, the number of enrollments for each sport. Your task is to put it into a SQL table for easier querying and updating. You are given a sports database.

  • Create Table
create table sport_types (
name varchar(100),
location varchar(100),
primary key (name)
);

In this command, I have created 2 fields in the sport_types table. A name field and a location field. These fields are varchar , which is a variable-length datatype of length 100. The primary key uniquely specifies the row in the table. In this case, it is the name of the sport_type. Primary key must always be UNIQUE.

  • Insert
insert into sport_types values ('Track and Field','Field');
insert into sport_types values ('Soccer','Field');
insert into sport_types values ('Gymnastics','Gymnasium');
insert into sport_types values ('Fencing','Gymnasium');
insert into sport_types values ('Swimming','Pool');
insert into sport_types values ('Pole Vaulting','Field');
insert into sport_types values ('Rugby','Field');
insert into sport_types values ('Diving','Pool');
insert into sport_types values ('Hockey','Gymnasium');
insert into sport_types values ('Frisbee','Field');
insert into sport_types values ('Water Polo','Pool');
insert into sport_types values ('Tennis','Court');
insert into sport_types values ('Basketball','Court');
insert into sport_types values ('Boxing','Gymnasium');
insert into sport_types values ('Volleyball','Court');
insert into sport_types values ('Baseball','Field');

I am inserting the records into the sport_types table. Notice that both fields are string. The first value will correspond to the name, and the second will be the location.

If you like to follow along, I have created another table, Students.

create table students (
id int,
first_name varchar(40),
last_name varchar(50),
gender varchar(2),
age int,
grade int,
sport varchar(100),
primary key (id)
);
insert into students values (1, 'James', 'Logan', 'M', 17, 84, 'Hockey');
insert into students values (2, 'Jack', 'Gilmore', 'M', 16, 92, 'Swimming');
insert into students values (3, 'Megan', 'Vansel', 'F', 16, 75, 'Swimming');
insert into students values (4, 'Shawn', 'Cattermole', 'M', 18, 60, 'Gymnastics');
insert into students values (5, 'Nicole', 'Lee', 'F', 18, 93, 'Volleyball');
insert into students values (6, 'Maria', 'Hill', 'F', 17, 82, 'Basketball');
insert into students values (7, 'Wes', 'Owen', 'M', 18, 73, 'Gymnastics');
insert into students values (8, 'Evan', 'Parker', 'M', 17, 80, 'Water Polo');
insert into students values (9, 'Archie', 'Dale', 'M', 18, 52, 'Tennis');
insert into students values (10, 'Jack', 'Blaine', 'M', 17, 77, 'Soccer');
insert into students values (11, 'Hannah', 'Kidd', 'F', 19, 91, 'Fencing');
insert into students values (12, 'Ivan', 'Rice', 'M', 16, 82, 'Diving');
insert into students values (13, 'Vanessa', 'Tsu', 'F', 18, 71, 'Pole Vaulting');
insert into students values (14, 'Christopher', 'Beck', 'M', 18, 83, 'Track and Field');
insert into students values (15, 'Clay', 'Powers', 'M', 17, 90, 'Frisbee');
insert into students values (16, 'Owen', 'Parker', 'M', 19, 80, 'Water Polo');
insert into students values (17, 'Jill', 'Powell', 'F', 18, 62, 'Fencing');
insert into students values (18, 'Tina', 'Mane', 'F', 19, 77, 'Boxing');
insert into students values (19, 'Michael', 'James', 'M', 19, 95, 'Swimming');
insert into students values (20, 'Jane', 'Bale', 'F', 17, 91, 'Diving');
insert into students values (21, 'Tommy', 'Jones', 'M', 19, 86, 'Track and Field');
insert into students values (22, 'Matthew', 'Frommers', 'M', 16, 57, 'Rugby');
insert into students values (23, 'Jeffery', 'Walsh', 'M', 18, 86, 'Frisbee');
insert into students values (24, 'John', 'Lee', 'M', 19, 64, 'Baseball');
Photo by Serena Repice Lentini on Unsplash
  • SELECT

The select function will select the fields specified. If you will want to select all rows, you can select *

SELECT * FROM sport_types;

This query will return all records of the table with all fields.

SELECT name FROM sport_types;

This query will return all records of the table with the name field. The location will NOT be returned.

SELECT first_name, gender, sport FROM students;

This query will return all records with the selected field, first_name, gender and sport from the table Students .

  • WHERE

When performing a SQL, sometimes we will like to filter the records returned. For example, we want to return all sports from the Pool location.

SELECT * FROM sport_types
WHERE location = 'Pool';

This will return all records with location = ‘Pool’

| name        | location      |
|------------ | ------------- |
| Swimming | Pool |
| Diving | Pool |
| Water Polo | Pool |
  • WHERE IN

The WHERE IN clause will return records in a list or subquery

SELECT * FROM sport_types
WHERE location IN ('Pool', 'Gymnasium');

This query will return all records with location either in ‘Pool’ or ‘Gymnasium’.

Results:

| name        | location      |
|------------ | ------------- |
| Gymnastics | Gymnasium |
| Fencing | Gymnasium |
| Swimming | Pool |
| Diving | Pool |
| Hockey | Gymnasium |
| Water Polo | Pool |
| Boxing | Gymnasium |
  • OR Clause

The OR Clause will return records either with the conditions that are being met.

SELECT * FROM sport_types
WHERE location = 'Pool'
OR location = 'Gymnasium';

This will return all records with either location = ‘Pool’ or location = ‘Gymnasium’. Notice the results is similar to the previous query (IN query).

  • AND Clause

The AND Clause will return results if all the conditions are met.

SELECT * from sport_types
WHERE name = 'Diving'
AND location = 'Pool';

This will return all records that have name = ‘Diving’ and location = ‘Pool’.

Results:

| name        | location      |
|------------ | ------------- |
| Diving | Pool |
  • NOT Clause

The NOT Clause will return records that negate the condition specified.

SELECT * FROM sport_types
WHERE location NOT IN ('Pool', 'Gymnasium);

The query will return records that have locations NOT IN ‘Pool’ OR ‘Gymnasium’.

Results:

| name             | location      |
|---------------- | ------------- |
| Track and Field | Field |
| Soccer | Field |
| Pole Vaulting | Field |
| Rugby | Field |
| Frisbee | Field |
| Tennis | Court |
| Basketball | Court |
| Volleyball | Court |
| Baseball | Field |
  • LIMIT

The limit function will limit the number of records

SELECT * FROM sport_types
WHERE location = 'Pool'
LIMIT 1;

This will return the top 1 result that has a location of ‘Pool’.

| name             | location      |
|---------------- | ------------- |
| Swimming | Pool |
  • ORDER BY

The ORDER BY clause will sort the records based on the conditions.

SELECT * FROM sport_types
WHERE location = 'Pool'
ORDER BY name

This query will sort the records by name in ascending order.

| name             | location      |
|---------------- | ------------- |
| Diving | Pool |
| Swimming | Pool |
| Water Polo | Pool |

You can also sort by descending order. By appending a DESC, the order will be in descending order.

SELECT * FROM sport_types
WHERE location = 'Pool'
ORDER BY name DESC;

Results:

| name             | location      |
|---------------- | ------------- |
| Water Polo | Pool |
| Swimming | Pool |
| Diving | Pool |

ORDER BY can also be done on numeric values. In the same way that was written, we can sort by numeric values.

SELECT * FROM students
WHERE gender = 'M'
ORDER BY age;

Comparison Operators

  • EQUAL(=)

As you have seen from the previous queries, we have used the equal sign to denote if the field is equal to the value. We can use the equal with numeric values too.

  • Greater than or equal to (>=)

The greater than or equal to signifies values greater than or equal to the value we specify.

SELECT * FROM students
WHERE age >= 18;

This query returns all records with students that have the age of 18 or greater.

  • Strictly Greater (>)

This query will select all records that are strictly greater than the threshold value specified.

SELECT * FROM students
WHERE age > 18;

This query returns all records with students that have the age greater than 18.

  • Less than or equal to (<=)

The less than or equal to signifies values less than or equal to the value we specify.

SELECT * FROM students
WHERE age <= 18;

This query returns all records with students that have the age of 18 or less.

  • Strictly Less (<)

This query will select all records that are strictly less than the threshold value specified.

SELECT * FROM students
WHERE age < 18;

This query returns all records with students that have the age less than 18.

Aggregate Functions

  • COUNT()

COUNT() will count the number of records.

SELECT COUNT(*) FROM students;

Results:

| count            |
|---------------- |
| 24 |
  • SUM()

SUM() will sum up all values in the field.

SELECT SUM(grade) FROM students;

This query will sum up the grades of all records in the Student’s table.

| sum              |
|---------------- |
| 1883 |
  • MAX()

MAX() will get the maximum value.

SELECT MAX(grade) FROM students;

This query will select the maximum grade of all the records.

| max              |
|---------------- |
| 95 |
  • MIN()

MIN() will get the minimum value.

SELECT MIN(grade) FROM students;

This query will select the maximum grade of all the records.

| min              |
|---------------- |
| 52 |
  • AVG()

AVG() will get the average value.

SELECT AVG(grade) FROM students;

This query will select the maximum grade of all the records.

| avg                   |
|---------------------- |
| 78.4583333333333333 |

Looking at the values, you can see we have many decimal points. There are functions in SQL that we can perform to round the value to n number of decimal points.

Photo by ThisisEngineering RAEng on Unsplash
  • ROUND
SELECT ROUND(AVG(grade),2) FROM students;

This will round the value to 2 decimal points.

  • TRUNC
SELECT TRUNC(AVG(grade),2) FROM students;

This will truncate the value to 2 decimal points.

  • CEIL
SELECT CEIL(AVG(grade)) FROM students;

This will round the value UP to the nearest integer.

  • FLOOR
SELECT FLOOR(AVG(grade)) FROM students;

This will round the value down to the nearest integer.

  • GROUP BY

The GROUP BY clause will group the records accordingly and you can perform any of these aggregate functions for more detailed querying.

Let’s assume you will like to find out what is the maximum grade, average grade and minimum grade for each age group of students.

SELECT age, MAX(grade), ROUND(AVG(grade), 2) as avg, MIN(grade) FROM students
GROUP BY age
ORDER BY age desc

Result:

| age      | max      | avg     | min     |
|--------- | -------- |-------- |-------- |
| 19 | 95 | 82.17 | 64 |
| 18 | 93 | 72.50 | 52 |
| 17 | 91 | 84.00 | 77 |
| 16 | 92 | 76.50 | 57 |

Hopefully, this will help with simple queries to filter and gather records! Feel free to reach out to me if you have any questions!

--

--