Using Neo4j and Cypher to reproduce São Paulo’s subway system

Lucas Moda
Level Up Coding
Published in
7 min readApr 3, 2020

--

Hello guys! Today I’ll be writing a short article explaining how to use a graph-based database to emulate a subway system, and how to integrate it with Python. The Github repo can be accessed here: https://github.com/lukmoda/graph-subway

Introduction

Graphs

Well, first of all, a quick reminder of what are graphs.

A graph is a data structure that consists of entities and connections. Let’s see some concepts:

  • Node: Is the object, the entity. It can be a person, a city, an animal; everything that is the subject of study (you can think about Nodes as nouns). Nodes are also called Vertices;
  • Properties: Very similar to OOP, Nodes can have properties, expressed in key-value pairs. Let’s say you have an Animal node “Lion”. It can have properties like {Size: 1.8m, Weight: 150kg, etc…}.;
  • Relationship: This expresses how the nodes are connected (you can think about it as verbs). They define what is the relation between nodes (e.g., “Node A LIKES Node B, Node C LIVES WITH Node D, etc). Relationships are also called Edges;
Example taken straight from Neo4j’s website.
  • Direction, Cycle and Weight: Those are “additional” properties of a graph, used to classify them. The direction simply tells to which node the relationship is related to. Similarly, connections can have weights (you can think about them as distance between nodes). Finally, if you can get back to a node after traversing through connections, we say the Graph is Cyclical. You might come to see some big words like DAG (Directed Acyclic Graph), but they just tell how the graph behaves.
Example of a Weighted DAG: Nodes’ (blue circles) connections have a direction, there are no cycles and edges have weights. As one can see, DAGs are very similar to another data structure: Trees.

Neo4j and Cypher

Neo4j is a graph-oriented database. Instead of traditional tables or schemas, entities are modeled like graphs, with nodes and connections. Since it is a NoSQL database, you use a proper query language called Cypher, which is pretty much straightforward and easy to understand. We will be using these technologies further ahead, but you can learn more in Neo4j’s website, it’s very well documented (https://neo4j.com/developer/graph-database/).

São Paulo’s Subway System

Now we come to the object of our study. Where I work (https://levee.com.br/), we use AI and ML to match candidates and jobs that have a good fit. We have millions of candidates registered in our platform, and studies conducted indicate that people that live closer to their jobs have a higher probability of succeeding at that position. Still, we wanted to go a bit further and check if the number of stations the person has to traverse is relevant to the model (spoiler: it is). It is not the aim of this article to show our model, I just wanted to explain what is a possible application.

To find the number of stations traversed, we needed to first model the subway system, which is… Very similar to a graph! Then, since we have each candidate and job lat-long, we can calculate which is the closest subway to each (that is, set the start and end node) and then find the shortest path. São Paulo’s Subway system is (as of early 2020) comprised of 13 lines (divided in modern, underground trains called “Metrô”, with 6 lines, and older, above-ground trains called “CPTM”, with 7 lines), 183 stations and 371 km of extension (https://www.metrocptm.com.br/veja-o-mapa-de-estacoes-do-metro-e-cptm/).

São Paulo’s Subway system. Can you see how relatable it is to a graph?

Building the Graph

Collecting data

Ok, this part is maybe not really necessary if you want to build your solution for your city, but it really helps to get better organization. What I did was create a table with each station and its lat-long, line by line. It helped me to not get lost:

Header of the file. Do it for every station of your system!

Start Neo4j server

After you installed Neo4j on your machine, let’s get the server up and running. It is as simple as that:

sudo neo4j start

After a while the server will be available at localhost:7474. You should see a screen like this:

Neo4j localhost web interface.

Create Nodes

Next, we will be editing a cypher script (common extensions are .cypher and .cql). Before we start creating the nodes, I will make a constraint on the id’s:

Here, s works like an alias for every Station object, and we are putting the constraint that the id’s have to be unique.

Now, let’s create our Station nodes, like an Insert in regular SQL:

Every Cypher statement ends with a semicolon. To create a node is simple, we just pass the keyword CREATE and inside parenthesis its id and which object (called Label in Neo4j) it represents. The keyword SET is used to create the node’s properties, separated by comma.

Create Relationships

Next, we need to create the relationships. In a subway system, I called the relationship “CONNECTS”. This process is done in two parts: Match and Merge:

Note that there is only one semicolon, at the end of the last statement. The “Match” phase is like a SQL SELECT, where you indicate which nodes (you can use aliases as well, like in this example) will be referenced to make the edges. Then, the “Merge” phase effectively creates the relationships. You need to pass start node and end node in parenthesis, and, between them, the connection verb in [:verb]. The “-” and “>” indicate Direction. This graph is directed, but you can make an undirected graph by merging nodes without “>”. Similarly, if you put arrows on both ends, the connection flows both ways. It’s simple, expressive, and elegant.

And to find the shortest path between two nodes? Neo4j already has a handy function for you:

Just pass start and end node, and use shortestPath function. If you want to return the whole path instead of the length, return nodes(p). The “[*..50]” is a constraint for the algorithm to stop traversing if the path exceeds 50 nodes. You can set it to any value or disable it entirely, but it can be very useful if your graph is very dense.

You can do a lot more stuff with Cypher:

And many, many more. For a complete guide on Cypher, check this Cheat Sheet: https://neo4j.com/docs/cypher-refcard/current/

Run the Script

You can either run a Cypher script or open a cypher-shell, passing the commands directly there. Cypher Shell looks like this:

You can use it just like a regular SQL-client shell, passing Cypher statements.

Or you can run the script directly in the terminal:

Just pass your username, password and cypher file after “<”.

After you create all the nodes and all relationships (that’s the hard part!), you can see your graph in the web UI:

Zoomed around São Paulo’s Center, with its famous Luz Station highlighted (see its properties in the bottom left corner)

The complete Graph looks like this:

Beautiful, isn’t it?

Integrating with Python

Now that our subway graph is done, let’s see how we can integrate it with python. Here is the simple script:

Python script to connect with Neo4j and do Cypher queries.

All you need to do is instantiate a Graph object from py2neo package, passing the server, user and password as arguments. Then you pass Cypher queries as strings and use the evaluate method to retrieve results. It is that simple!

This is what we get when we run the script to find number of stations and path between Butantã (Line 4-Yellow) and Tatuapé (Line 3-Red):

Output of script.

With the graph, connections and queries all ready, the structure to create the feature containing the number of stations between candidate and job is ready!

Final Words

I hope this article helped people wishing to use the power of graphs to do analysis. I had only heard about Neo4j before having to build this feature, and was able to quickly learn and develop this solution. This database is very well documented, scalable, easy to use/learn and integrated with many languages (such as Python); Cypher is nothing to be afraid about as well. Now when you see a problem that can be modeled by a graph, strongly consider using Neo4j and Cypher to build your solution!

--

--