Graph Database Technologies: Neo4j and Cypher Query Language

I. Neo4j

Neo4j is an open-source graph database implemented in Java, its data are stored in graphs rather than in tables. This way it allows more efficient visualization and analysis for graph-based relationships. Neo4j was first created in Sweden in 2007, and now widely used for many customer-based websites (see http://neo4j.com/customers/ for a list).

(picture from: http://www.slideshare.net/emileifrem/an-intro-to-neo4j-and-some-use-cases-jfokus-2011)

Graph chart is useful for many cases:

model social network with nodes and relationships, and properties on both.
calculate the shortest path between two nodes. For example, the websites are nodes, the links between them are edges, and the calculation tells us how many links we need from one website to another website.
find nodes with similar activities patterns. For example, an online dating website can use this attribute to find people with same in-degree and out-degree for certain activities.
match complicated relationship. For example, we want to find all the Youtube users with a specific register date who operated on a certain video channel which is subscribed by more than 10M users.

(picture from Neo4j 8/14/2014 webinar)

See more use cases, please visit http://neo4j.com/use-cases/

Advantages of Neo4j:

We can see that in traditional SQL DB environment the above queries could take a lot of code and a lot of memory to calculate. But in Neo4j and Cypher Language, minimum code and optimized performance could be achieved.

If you know a little bit of computer systems, you’ll also understand that Neo4j is a system-friendly and optimized product.

In Neo4j, nodes and relationships are stored in separate files. Each node is stored as constant length record with pointer to first property and first relationship. Each relationship is stored as constant length record with pointer to previous and next relationship. This constant length record allows fast look-up. (from Stanford CS145 summer class)

<– Neo4j server web interface

II. Cypher

So what is Cypher?

Cypher is a declarative graph query language for Neo4j. It’s a SQL-like language, but allows for more expressive and efficient querying and updating of the graph store.

The syntax is very intuitive:

 MATCH (Person { name:'Charlie Sheen' })-[:ACTED_IN]-(movie:Movie)
   RETURN movie

In the above we find all the movies acted by actors (Person) with the name “Charlie Sheen”.

Below example enables us to find all nodes which have no more than three layers of relationships between them:

MATCH p = shortestPath (( a ) -[*..3] -( b ) )
where a <> b
RETURN a . name , b . name , length ( p )

Does this remind you of LinkedIn?

See more powerful functions of Cypher, refer to the reference card. bit.ly/cypher-refcard

There’re also advanced track where people do more powerful and creative stuff. Below is in a Graph Database meetup in Chicage where the presenter showed how to hook up to one of the social networks (facebook, twitter, linkedin) and import profiles and relationships in to your graph.

Looking forward to more exciting development in graph database.

RCharts example

Rcharts is a wonderful tool for data analysis and visualization. A month ago I tried all the examples in Rcharts gallery. Here is an example I built:

Adult obesity rate map with hover showing state names.

I can change the color scale, data input, and link data from SQL server to R using .csv files. Check out more example on the gallery website: http://rcharts.io/gallery/

D3 Visualization Example

Last week in my Data Science internship, I created some visualizations with D3. I used two github open source libraries “Circular Heat Chart” and “stackpercent”. This is a wonderful material for presenting multi-dimensional data!

In the chart above, the chart is representing 3 time blocks in a day:

Midnight – 4am, 4am – 8am, 8am-noon

The labels are very easy to update, so there’s tons of flexibility to use this to monitor data of any time periods. And different colors represent different status, you can define whatever you like.

Below are two sample charts I made:

Documentation:

1. Purpose: Display a single day’s six shifts’ coverage quota.

2. Tools: D3, JS, CSS

3. Variable: four colors representing “Committed”, “One-time Covered”, “One-time Cancel”, “Never Filled”.

The color settings are stored as variables in “day.js”

4. Input requirement: pass in a variable comprised of elements corresponding to color variable names. For example:

var shifData = [

cover, cover, cover, cover, cover, cover, // The most inner ring, clockwise

cover, cover, one_time, cover, cover, cover, // The second inner ring, clockwise

one_time, cover, one_time, cover, cover, cover,

cancel, one_time, cancel, cover, cover, one_time,

cancel, one_time, never, one_time, one_time, cancel,

never, cancel, never, cancel, never, never // The most outer ring, clockwise

];

Documentation:

1. Purpose: Display a single day’s six shifts’ coverage quota.

2. Tools: JS, CSS, opensource (https://github.com/skeleton9/flot.stackpercent)

3. Variable:

label: representing a shift coverage state

data-width: the amount/quota of this particular state

data-height: the order of shift, i.e. 1 for 1st shift, 2 for 2nd shift

color: visualization for this shift coverage state

4. Input requirement: pass in a variable comprised of four objects, each with “label”, “data”, and “color”. For example:

var data = [

{“label”:”Commited”, “data”: [[12, 1], [10, 2], [7,3], [7,4], [7,5], [7,6]], “color”:”green”}, // [12,1] here stands for 12 Commited doctors for the 1st shift

{“label”:”Commited(leave)”, “data”:[[11,1], [9, 2], [8,3], [7,4], [7,5], [7,6]], “color”:”#83BA4F”},

{“label”:”Cancel(one-time)”, “data”:[[10,1], [6, 2], [5,3], [7,4], [7,5], [7,6]], “color”:”#E78800″},

{“label”:”No Fill Ever”, “data”:[[10,1], [6, 2], [5,3], [7,4], [7,5], [7,6]], “color”:”red”}

];

Python Web Application – AuctionBase (adding bids like Ebay!)

These three weeks I built a web application using Python (web.py and jinja2), and SQLite. It’s a website where you can add bids to open items, browse items of interest, and search item details (started date, ends date, current price, who sells it, how many bids, and descriptions, etc.)

There’re three milestones for this project:

1. Database preparation

Design Database table schemas, so that the database takes as little space as possible to store, while at the same time has the best organization to manage, update, delete, and search;
Writing Python script to convert .JSON files to data files (.dat);
Load into SQLite database, write queries to check the performance of loading;

2. Setting integrity constraints

Writing SQL constraints, triggers, to ensure integrity of the data in database;
Testing triggers, load triggers into database;

3. Web application and interface

Writing web.py scripts and jinja2 templates to set up user interface – input values, select current time, search item ID, browse item of interest based on criteria, and add bids;
Using Transactions to ensure robustness of performances;
Error management: when an error on search/add bid occurs, it gets tackled neatly, gives error message, and allows the user to keep interacting with the website;
Testing constraints and triggers to ensure normal bidding behaviors.

Below are some screenshots to show the project progress:

–Before organizing the data, a mess for human eyes–