Importing bulk data into neo4j - an efficiency analysis

2 minute read

Written:

Neo4j version - Community Edition 3.5.2
Language Driver - Python neo4j-driver 1.7.1

What is neo4j?

neo4j is a Graph Database.

The graph is represented by nodes (vertex), their relationships (edge) and their properties (property). Nodes are connected by edges (their relationships). Both vertex(nodes) and edges(relationship) could have one or multiple properties. They are stored in neo4j as a directional graph.

Some advantages of neo4j

  • Friendly query language (Cypher)
  • Full ACID properties of each transaction
  • High scalibility
  • Flexible network structure rather than static tables
  • High performance compared to relational database

Some other graph database includes Oracle NoSQL, OrientDB, HypherGraphDB,GraphBase,InfiniteGraph,AllegroGraph.

Some concepts of Graph Database

Node: the building block of a graph.

Relationship: connect two nodes. Relationship always contains a direction.

Property: key-value pair. Both node and relationship may have properties.

Path: a path contains one or more nodes connecting by relationships. It usually represents the result of one specific query.

The query is implemented by relationship-based graph traversal. neo4j provides a handful of traversal APIs.

Different means to import nodes

  1. Cypher CREATE: create a node for each data point
  2. Cypher LOAD CSV: convert data to csv file, and import data with LOAD CSV API
  3. Batch Inserter: a Java API
  4. Batch Importer by Michael Hunger
  5. Neo4j-import

Analysis of different importing methods

 CREATELOAD CSVBatch InserterBatch ImporterNeo4j-import
When to use?1 - 10,000 nodes10,000 - 100,000 nodes> 1,000,000 nodes> 1,000,000 nodes> 1,000,000 nodes
Speedvery slow (1000 nodes/s)slow ( 5000 nodes/s )fast (10,000 nodes/s)fast ( > 10,000 nodes/s)fast ( > 10,000 nodes/s)
Advantagesconvenient, real time importreal-time import, pre-load local/server CSVfastBased on Batch Inserter, exexcute complied JAR, direct import data from databaseofficial release, cost less resources than Batch Inserter
Disadvantagesextremely slowslow, need to convert to CSVuse only in JAVA, need to convert to CSV first, have to stop neo4j while importingconvert to CSV first, not real-timeconvert to CSV first, not real-time, can only import into new database (NO for existing database)

Speed test

1. CREATE Every 1000 nodes per transaction

CREATE (:label {property1:value, property2:value, property3:value} )
115000 nodes185000 nodes
100s160s

2. LOAD CSV

using periodic commit 1000 #Every 1000 nodes per transaction
load csv from "file:///fscapture_screencapture_syscall.csv" as line
create (:label {a:line[1], b:line[2], c:line[3], d:line[4], e:line[5], f:line[6], g:line[7], h:line[8], i:line[9], j:line[10]})
115000 nodes185000 nodes
21s39s

3. Neo4j-import

Neo4j-import needs to be executed on server. Allocation of server resources has direct impact on the speed of importing. Here, I allocated 16Gb for JVM on my server.

sudo ./bin/neo4j-import --into graph.db --nodes:label path_to_csv.csv
115000 nodes185000 nodes1,500,000 nodes 15,000,000 edges30,000,000 nodes 78,000,000 edges
3.4s3.8s26.5s3 min 48s

Conclusion

  1. For new project, Neo4j-import is the best choice for large, fast batch importing.
  2. For existed project, if you can afford interruptions on your current database, Batch Import is your best bet. Or you can implement by yourself based on open source Batch inserter.
  3. For existed project, without any interruptions on current database, LOAD CSV is the best choice.
  4. For simple real-time importing, just use CREATE.