Importing bulk data into neo4j - an efficiency analysis

2 minute read

Written: June 15, 2019

Neo4j version - Community Edition 3.5.2
Language Driver - Python neo4j-driver 1.7.1

What is neo4j?

neo4j is a Graph Database.

The graph is represented by nodes (vertex), their relationships (edge) and their properties (property). Nodes are connected by edges (their relationships). Both vertex(nodes) and edges(relationship) could have one or multiple properties. They are stored in neo4j as a directional graph.

Some advantages of neo4j

Friendly query language (Cypher)
Full ACID properties of each transaction
High scalibility
Flexible network structure rather than static tables
High performance compared to relational database

Some other graph database includes Oracle NoSQL, OrientDB, HypherGraphDB，GraphBase，InfiniteGraph，AllegroGraph.

Some concepts of Graph Database

Node: the building block of a graph.

Relationship: connect two nodes. Relationship always contains a direction.

Property: key-value pair. Both node and relationship may have properties.

Path: a path contains one or more nodes connecting by relationships. It usually represents the result of one specific query.

The query is implemented by relationship-based graph traversal. neo4j provides a handful of traversal APIs.

Different means to import nodes

Cypher CREATE: create a node for each data point
Cypher LOAD CSV: convert data to csv file, and import data with LOAD CSV API
Batch Inserter: a Java API
Batch Importer by Michael Hunger
Neo4j-import

Analysis of different importing methods

	CREATE	LOAD CSV	Batch Inserter	Batch Importer	Neo4j-import
When to use?	1 - 10,000 nodes	10,000 - 100,000 nodes	> 1,000,000 nodes	> 1,000,000 nodes	> 1,000,000 nodes
Speed	very slow (1000 nodes/s)	slow ( 5000 nodes/s )	fast (10,000 nodes/s)	fast ( > 10,000 nodes/s)	fast ( > 10,000 nodes/s)
Advantages	convenient, real time import	real-time import, pre-load local/server CSV	fast	Based on Batch Inserter, exexcute complied JAR, direct import data from database	official release, cost less resources than Batch Inserter
Disadvantages	extremely slow	slow, need to convert to CSV	use only in JAVA, need to convert to CSV first, have to stop neo4j while importing	convert to CSV first, not real-time	convert to CSV first, not real-time, can only import into new database (NO for existing database)

Speed test

1. CREATE Every 1000 nodes per transaction

CREATE (:label {property1:value, property2:value, property3:value} )

115000 nodes	185000 nodes
100s	160s

2. LOAD CSV

using periodic commit 1000 #Every 1000 nodes per transaction
load csv from "file:///fscapture_screencapture_syscall.csv" as line
create (:label {a:line[1], b:line[2], c:line[3], d:line[4], e:line[5], f:line[6], g:line[7], h:line[8], i:line[9], j:line[10]})

115000 nodes	185000 nodes
21s	39s

3. Neo4j-import

Neo4j-import needs to be executed on server. Allocation of server resources has direct impact on the speed of importing. Here, I allocated 16Gb for JVM on my server.

sudo ./bin/neo4j-import --into graph.db --nodes:label path_to_csv.csv

115000 nodes	185000 nodes	1,500,000 nodes 15,000,000 edges	30,000,000 nodes 78,000,000 edges
3.4s	3.8s	26.5s	3 min 48s

Conclusion

For new project, Neo4j-import is the best choice for large, fast batch importing.
For existed project, if you can afford interruptions on your current database, Batch Import is your best bet. Or you can implement by yourself based on open source Batch inserter.
For existed project, without any interruptions on current database, LOAD CSV is the best choice.
For simple real-time importing, just use CREATE.

Share on

Twitter Facebook LinkedIn

Jiuyuan Wang

Importing bulk data into neo4j - an efficiency analysis

What is neo4j?

Some advantages of neo4j

Some concepts of Graph Database

Different means to import nodes

Analysis of different importing methods

Speed test

Conclusion

Share on

You May Also Enjoy

ML: Neural Network and Deep learning

ML: Gradient Descent

ML: Support Vector Machine

ML: Linear Regression from scratch