Importing bulk data into neo4j - an efficiency analysis
Written:
Neo4j version - Community Edition 3.5.2
Language Driver - Python neo4j-driver 1.7.1
What is neo4j?
neo4j is a Graph Database.
The graph is represented by nodes (vertex), their relationships (edge) and their properties (property). Nodes are connected by edges (their relationships). Both vertex(nodes) and edges(relationship) could have one or multiple properties. They are stored in neo4j as a directional graph.
Some advantages of neo4j
- Friendly query language (Cypher)
- Full ACID properties of each transaction
- High scalibility
- Flexible network structure rather than static tables
- High performance compared to relational database
Some other graph database includes Oracle NoSQL, OrientDB, HypherGraphDB,GraphBase,InfiniteGraph,AllegroGraph.
Some concepts of Graph Database
Node: the building block of a graph.
Relationship: connect two nodes. Relationship always contains a direction.
Property: key-value pair. Both node and relationship may have properties.
Path: a path contains one or more nodes connecting by relationships. It usually represents the result of one specific query.
The query is implemented by relationship-based graph traversal. neo4j provides a handful of traversal APIs.
Different means to import nodes
- Cypher CREATE: create a node for each data point
- Cypher LOAD CSV: convert data to csv file, and import data with LOAD CSV API
- Batch Inserter: a Java API
- Batch Importer by Michael Hunger
- Neo4j-import
Analysis of different importing methods
CREATE | LOAD CSV | Batch Inserter | Batch Importer | Neo4j-import | |
---|---|---|---|---|---|
When to use? | 1 - 10,000 nodes | 10,000 - 100,000 nodes | > 1,000,000 nodes | > 1,000,000 nodes | > 1,000,000 nodes |
Speed | very slow (1000 nodes/s) | slow ( 5000 nodes/s ) | fast (10,000 nodes/s) | fast ( > 10,000 nodes/s) | fast ( > 10,000 nodes/s) |
Advantages | convenient, real time import | real-time import, pre-load local/server CSV | fast | Based on Batch Inserter, exexcute complied JAR, direct import data from database | official release, cost less resources than Batch Inserter |
Disadvantages | extremely slow | slow, need to convert to CSV | use only in JAVA, need to convert to CSV first, have to stop neo4j while importing | convert to CSV first, not real-time | convert to CSV first, not real-time, can only import into new database (NO for existing database) |
Speed test
1. CREATE Every 1000 nodes per transaction
CREATE (:label {property1:value, property2:value, property3:value} )
115000 nodes | 185000 nodes |
---|---|
100s | 160s |
2. LOAD CSV
using periodic commit 1000 #Every 1000 nodes per transaction
load csv from "file:///fscapture_screencapture_syscall.csv" as line
create (:label {a:line[1], b:line[2], c:line[3], d:line[4], e:line[5], f:line[6], g:line[7], h:line[8], i:line[9], j:line[10]})
115000 nodes | 185000 nodes |
---|---|
21s | 39s |
3. Neo4j-import
Neo4j-import needs to be executed on server. Allocation of server resources has direct impact on the speed of importing. Here, I allocated 16Gb for JVM on my server.
sudo ./bin/neo4j-import --into graph.db --nodes:label path_to_csv.csv
115000 nodes | 185000 nodes | 1,500,000 nodes 15,000,000 edges | 30,000,000 nodes 78,000,000 edges |
---|---|---|---|
3.4s | 3.8s | 26.5s | 3 min 48s |
Conclusion
- For new project, Neo4j-import is the best choice for large, fast batch importing.
- For existed project, if you can afford interruptions on your current database, Batch Import is your best bet. Or you can implement by yourself based on open source Batch inserter.
- For existed project, without any interruptions on current database, LOAD CSV is the best choice.
- For simple real-time importing, just use CREATE.