```
Neo4j version - Community Edition 3.5.2
Language Driver - Python neo4j-driver 1.7.1
```

*neo4j* is a Graph Database.

The graph is represented by nodes (vertex), their relationships (edge) and their properties (property).
Nodes are connected by edges (their relationships). Both vertex(nodes) and edges(relationship) could have one or multiple properties.
They are stored in *neo4j* as a directional graph.

- Friendly query language (Cypher)
- Full ACID properties of each transaction
- High scalibility
- Flexible network structure rather than static tables
- High performance compared to relational database

Some other graph database includes Oracle NoSQL, OrientDB, HypherGraphDB，GraphBase，InfiniteGraph，AllegroGraph.

Node: the building block of a graph.

Relationship: connect two nodes. Relationship always contains a direction.

Property: key-value pair. Both node and relationship may have properties.

Path: a path contains one or more nodes connecting by relationships. It usually represents the result of one specific query.

The query is implemented by relationship-based graph traversal. *neo4j* provides a handful of traversal APIs.

- Cypher CREATE: create a node for each data point
- Cypher LOAD CSV: convert data to csv file, and import data with LOAD CSV API
- Batch Inserter: a Java API
- Batch Importer by Michael Hunger
- Neo4j-import

CREATE | LOAD CSV | Batch Inserter | Batch Importer | Neo4j-import | |
---|---|---|---|---|---|

When to use? | 1 - 10,000 nodes | 10,000 - 100,000 nodes | > 1,000,000 nodes | > 1,000,000 nodes | > 1,000,000 nodes |

Speed | very slow (1000 nodes/s) | slow ( 5000 nodes/s ) | fast (10,000 nodes/s) | fast ( > 10,000 nodes/s) | fast ( > 10,000 nodes/s) |

Advantages | convenient, real time import | real-time import, pre-load local/server CSV | fast | Based on Batch Inserter, exexcute complied JAR, direct import data from database | official release, cost less resources than Batch Inserter |

Disadvantages | extremely slow | slow, need to convert to CSV | use only in JAVA, need to convert to CSV first, have to stop neo4j while importing | convert to CSV first, not real-time | convert to CSV first, not real-time, can only import into new database (NO for existing database) |

**1. CREATE**
Every 1000 nodes per transaction

```
CREATE (:label {property1:value, property2:value, property3:value} )
```

115000 nodes | 185000 nodes |
---|---|

100s | 160s |

**2. LOAD CSV**

```
using periodic commit 1000 #Every 1000 nodes per transaction
load csv from "file:///fscapture_screencapture_syscall.csv" as line
create (:label {a:line[1], b:line[2], c:line[3], d:line[4], e:line[5], f:line[6], g:line[7], h:line[8], i:line[9], j:line[10]})
```

115000 nodes | 185000 nodes |
---|---|

21s | 39s |

**3. Neo4j-import**

Neo4j-import needs to be executed on server. Allocation of server resources has direct impact on the speed of importing. Here, I allocated 16Gb for JVM on my server.

```
sudo ./bin/neo4j-import --into graph.db --nodes:label path_to_csv.csv
```

115000 nodes | 185000 nodes | 1,500,000 nodes 15,000,000 edges | 30,000,000 nodes 78,000,000 edges |
---|---|---|---|

3.4s | 3.8s | 26.5s | 3 min 48s |

- For new project, Neo4j-import is the best choice for large, fast batch importing.
- For existed project, if you can afford interruptions on your current database, Batch Import is your best bet. Or you can implement by yourself based on open source Batch inserter.
- For existed project, without any interruptions on current database, LOAD CSV is the best choice.
- For simple real-time importing, just use CREATE.

These models are calledfeedforwardbecause information ﬂows through thefunction being evaluated fromx, through the intermediate computations used todeﬁnef, and ﬁnally to the outputy. There are nofeedbackconnections in whichoutputs of the model are fed back into itself. When feedforward neural networksare extended to include feedback connections, they are calledrecurrent neuralnetworks

Feedforward neural networks are callednetworksbecause they are typicallyrepresented by composing together many diﬀerent functions. The model is asso-ciated with a directed acyclic graph describing how the functions are composedtogether. For example, we might have three functionsf(1),f(2), andf(3)connectedin a chain, to formf(x) =f(3)(f(2)(f(1)(x))). These chain structures are themost commonly used structures of neural networks. In this case,f(1)is calledtheﬁrst layerof the network,f(2)is called thesecond layer, and so on. The overall length of the chain gives thedepthof the model. The name “deep learning”arose from this terminology. The ﬁnal layer of a feedforward network is called theoutput layer.

During neural network training, we drivef(x) to matchf∗(x).The training data provides us with noisy, approximate examples off∗(x) evaluatedat diﬀerent training points. Each examplexis accompanied by a labely ≈ f∗(x).The training examples specify directly what the output layer must do at each pointx; it must produce a value that is close to y. The behavior of the other layers isnot directly speciﬁed by the training data. The learning algorithm must decidehow to use those layers to produce the desired output, but the training data donot say what each individual layer should do. Instead, the learning algorithm mustdecide how to use these layers to best implement an approximation of f∗. Becausethe training data does not show the desired output for each of these layers, theyare called hidden layers.

Each hidden layer of the network is typically vector valued. Thedimensionality of these hidden layers determines thewidthof the model. Eachelement of the vector may be interpreted as playing a role analogous to a neuron. Each unit resembles a neuron inthe sense that it receives input from many other units and computes its ownactivation value.

It is best to think offeedforward networks as function approximation machines that are designed toachieve statistical generalization, occasionally drawing some insights from what weknow about the brain, rather than as models of brain function.

References: Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT press.

]]>Imagine we have two clusters of points (see figure below) to classify. How do you do it?

Simple linear regression: a very straightforward simple linear approach for predicting a quantitative response Y on the basis of a single predictor variable X. It assumes that there is approximately a linear relationship between X and Y. Mutiple linear regression: a better approach is to extend the simple linear regression model so that it can directly accommodate multiple predictors. Polynomial regression: in some cases, the true relationship between the response and the predictors may be nonlinear.

- it is well undrestood
- it is inherited from statisticians (many ways to evaluate and diagnose)
- it uses a closed-form formula to compute the solution
- the result is very easy to interpret

**Assumptions**: The observed response (dependent) variable (y) is the true function f(x) with additive Gaussian noise (e) with a mean of 0.

*Translate it to english:* The expected value of the response variable should be a linear combination of k independent attributes/features.

**Hypothesis Space**: the set of linear functions (hyperplanes)

Then, the goal is to find the weights of a function

that minimize the sum of squared residuals

It turns out we have a close-form solution for this question

If you are interested in the derivation of this close-form solution, please check it here

Regression: learning a function to map from a n-tuple to a continuous value.

Classification: learning a function to map from a n-tuple to a discrete value from a finite set

To clean a linear classifer, what we have to do: 1) Label each class by a number 2) Call that number the response variable 3) Analytical derive a regression line 4) Round the regression output to the nearest label number

1) Non-linearity of the response-predictor relationships 2) Correlation of error terms 3) Non-constant variance of error terms 4) Outliers 5) High-leverage points 6) Collinearity

The perceptron algorithm assigns the initial weights randomly and traverse all the input points to compute the sum of weighted inputs for each point. If the sum of weighted input is larger than the threshold (often times 0), the perceptron algorithm will assign it to be 1 (activated), otherwise 0 (deactivated). If all the predicted output is the same as the desired output, then the performance is considered satisfactory and no changes to the weights are made; otherwise the weighs need to be hanged to reduce the error.

The termination conditions are 1) find the satisfactory weights successfully or 2) the algorithm exceeds the maximum iteration.

```
g(x) = wTx
```

```
h(x) = 1 (if g(x) > 0)
h(x) = -1 (otherwise)
```

```
w = some random setting
Do for (certain amount of iterations)
k = (k+1)mod(m)
if h(x_k) != y_k
w = w + xy
Until Vk, h(x_k) = y_k
```

The termination conditons are: 1) find the linearly separable function (weights); 2) exceed the iterations but do not find the linearly separable function.

- Get some example set of cases with known outputs
- When seeing a new case, assign its output to be the same as the most similar known case. (Then the question becomes: How do you define the most similar known case??)

Whether we are playing chess or learning to drive a car, we are acutely aware of how our environment responds to what we do, and we seek to influence what happens through our behavior. Reinforcement Learning learns from interaction. The task of it is to learn how to behave to achieve a goal.

Reinforcement learning 1) much more focused on goal-directed learning, and 2) it explicitly considers the whole problem of a goal-directed agent interacting with an uncertain environment.

1) **Supervised Learning** is learning from a training set of labeled examples provided by a knowledgable external supervisor. The object of this kind of learning is for the system to extrapolate, or generalize, its responses so that it acts correctly in situations not present in the training set.

2) **Unsupervised Learning** is typically about finding structure hidden in collections of unlabeled data.

3) **Reinforcement Learning** is trying to maximize a reward signal instead of trying to find hidden structure.

A single decision is made from multiple discrete actions, and each action has an associated reward. The goal is to maximize the reward. We can pick the action with the largest reward each time (e.g. *greedy*).

The problem of reinforcement learning using ideas from dynamical systems theory, specifically, as the optimal control of incompletely-known Markov decision processes. The basic idea is simply to capture the most important aspects of the real problem facing a learning agent interacting over time with its environment to achieve a goal. **Each decision affects subsequent decisons.** 1) A learning agent must be able to sense the state of its environment to some extent and must be able to take actions that aspect the state. 2) The agent also must have a goal or goals relating to the state of the environment.

One of the challenges that arise in reinforcement learning, and not in other kinds of learning, is the trade-off between exploration and exploitation. To obtain a lot of reward, a reinforcement learning agent must prefer actions that it has tried in the past and found to be effective in producing reward. But to discover such actions, it has to try actions that it has not selected before. The agent has to exploit what it has already experienced in order to obtain reward, but it also has to explore in order to make better action selections in the future. The agent must try a variety of actions and progressively favor those that appear to be best.

All involve interaction between an active decision-making agent and its environment, within which the agent seeks to achieve a goal despite uncertainty about its environment. The agent’s actions are permitted to affect the future state of the environment (e.g., the next chess position, the level of reservoirs of the refinery, the robot’s next location and the future charge level of its battery), thereby a↵ecting the actions and opportunities available to the agent at later times. Correct choice requires taking into account indirect, delayed consequences of actions, and thus may require foresight or planning.

the effects of actions cannot be fully predicted; thus the agent must monitor its environment frequently and react appropriately. the agent can use its experience to improve its performance over time.

**Policy**, **Reward**, **Value function**, and **Model of the environment**.

Policy defines the learning agent’s way of behaving at a given time. It is a mapping from perceived states of the environment to actions to be taken when in those states.

Reward defines the goal of a reinforcement learning problem. The environment returns the agent a value as reward at each step. The goal of the learning is simply maximize the rewards.

Value function is different with the reward signal. It defines what is good running the policy from the step, rather than an immediate step.

Model of the environment mimics the behavoir of the environment. It allows infereence to be made about how the environment will behave.

The on-policy learners evaluate the policy that is used to make decisions. It is influeneced by exploration policy. Off-policy method evaluate policy that is different from what has been used to generate the data, it is not dependent on the exploration policy.

For example: On-policy learner: SARSA;

Off-policy learner: Q-learning.

]]>Objective: find a shorter tree that is as much consistent with the training examples as possible.

ID3 algorithm: (recursively) choose “most significant” attribute as root of a (sub)tree.

```
function DTL(examples, attributes, default) returens a decision tree
if examples is empty
return default
else if all examples have the same classification
return the classification
else
best-attribute = CHOOSE-ATTRIBUTE(attributes, examples)
tree = a new decision tree with root best-attribute
for each value vi of best-attribute do
examples_i = {elements of examples with best = v_i}
subtree = DTL(examples_i, (attribute - best), MODE(examples))
add a branch to tree with label v_i and subtree subtree
return tree
```

A confusion matrix, also known as an error matrix,is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one.

It is a special kind of contingency table, with two dimensions (“actual” and “predicted”), and identical sets of “classes” in both dimensions (each combination of dimension and class is a variable in the contingency table).

two rows and two columns that reports the number of false positives, false negatives, true positives, and true negatives.

Accuracy is a description of systematic errors, a measure of statistical bias; as these cause a difference between a result and a “true” value, ISO calls this trueness.

precision (also called positive predictive value) is the fraction of relevant instances among the retrieved instances.

recall (also known as sensitivity) is the fraction of relevant instances that have been retrieved over the total amount of relevant instances.

In statistical analysis of binary classification, the F1 score (also F-score or F-measure) is a measure of a test’s accuracy. It considers both the precision p and the recall r of the test to compute the score: p is the number of correct positive results divided by the number of all positive results returned by the classifier, and r is the number of correct positive results divided by the number of all relevant samples (all samples that should have been identified as positive). The F1 score is the harmonic average of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0.

]]>A tree is a collection of nodes *connected* by directed (or undirected) edges.

*Acyclic*

A tree is a *nonlinear data structure*, compared to arrays, linked lists, stacks and queues which are linear data structures.

A tree can be empty with no nodes or a tree is a structure consisting of one node called the root and zero or one or more subtrees.

If a given tree has *n* nodes, it has *n-1* edges

```
public class TreeNode {
public int key;
public TreeNode left;
public TreeNode right;
public TreeNode(int key) {
this.key = key;
}
}
```

**1. Height:** the height of a node is the number of edges from that node to the deepest leaf.

```
public int height(TreeNode root){
if(root == null){
return 0;
}
int left = height(root.left);
int right = height(root.right);
return Math.max(left, right)+1;
}
//Time complexity: O(n), if balanced O(height)
//Space complexity: O(height)
```

**2. Depth:** the depth of a node is the number of edgees from the root to that node.

```
//depth of a certain node
//Assumption: the node is in the tree
public int depth(TreeNode root, TreeNode target){
if(root == null) {return -1;}
return depth(root, target, 1);
}
private int depth(TreeNode root, TreeNode target, int level){
if(root == null){
return 0;
}
if(root == target) {
return level;
}
int result = 0;
result = depth(root.left, target, level+1);
//case 1: found in left subtree
if(result != 0) {
return result;
}
//case 2: found in right subtree
result = depth(root.right, target, level+1);
if(result != 0) {
return result;
}
return -1;
}
```

**3. Balanced tree:** for each node, the heights difference of its left subtree and right subtree is no larger than 1.

```
// Solution 1: Top down solution (use its definition, check the balances of left and right subtree, as well as itself)
// Time complexity: O(nlogn) = logN(main function) * n(get height)
// Space complexity: O(n)
pulic boolean isBalanced(TreeNode root){
if(root == null) {
return true;
}
int left = getHeight(root.left);
int right = getHeight(root.right);
if(Math.abs(left-right) > 1) {
return false;
}
return isBalanced(root.left) && isBalanced(root.right);
}
private int getHeight(TreeNode root){
//Time complexity: O(n), if balanced O(height)
//Space complexity: O(height)
if(root == null){
return 0;
}
int left = height(root.left);
int right = height(root.right);
return Math.max(left, right)+1;
}
//Solution 2: Bottom-up solution (Check its balance when getting its height, and return height if balanced)
//Time complexity: O(n), if balanced O(height)
//Space complexity: O(height)
pulic boolean isBalanced(TreeNode root){
return isBalanced(root) != -1;
}
private int isBalanced(TreeNode root){
// returns height of current node if all its subtrees including itself is balanced, otherwise return -1.
if(root == null){
return 0;
}
int left = height(root.left);
int right = height(root.right);
if(left == -1 || right == -1 || Math.abs(left-right) > 1){
return -1;
}
return Math.max(left, right)+1;
}
```

**4. Compelete tree:** Except the last level, nodes on other levels are fully filled. The nodes on the last level are as left as possible. <=> if we meets bubble/null nodes, the rest of nodes should all be null.

```
// Complete Tree == if we meet a null node, rest of nodes all have to be null.
public boolean isComplete(TreeNode root){
if(root == null) {
return true;
}
Queue<TreeNode> queue = new LinkedList<TreeNode>();
boolean meetNull = false;
queue.offer(root);
while(queue.isEmpty()){
TreeNode curr = queue.poll();
//case 1: check left subtree, check if left tree is null?
// if meetNull, then rest of them have to be null, otherwise return false.
if(curr.left != null){
if(meetNull){
return false;
}
queue.offer(curr.left);
} else {
meetNull = true;
}
//case 2: check right subtree, check if right tree is null?
// if meetNull, then rest of them have to be null, otherwise return false.
if(curr.right != null){
if(meetNull){
return false;
}
queue.offer(curr.right);
} else {
meetNull = true;
}
}
return true;
}
```

**5. Full tree:** for each node, it has either no children or two children.

```
//Time complexity: O(height)
//Space complexity: O(height)
public boolean isFull(TreeNode root){
if(root == null) {
return true;
}
if(root.left == null && root.right == null){
return true;
}
if(root.left != null && root.right != null){
return isFull(root.left) && isFull(root.right);
}
return false;
}
```

- PreOrder

```
//C++ code
//Data structure: Stack
public:
vector<int> preorderTraversal(TreeNode* root) {
vector<int> result;
stack<TreeNode*> stk;
if (root==NULL) return result;
stk.push(root);
while (!stk.empty()) {
root = stk.top();
result.push_back(root->val);
stk.pop();
if(root->right != NULL){
stk.push(root->right)
};
if(root->left != NULL){
stk.push(root->left)
};
}
return result;
}
```

- InOrder

```
//C++ code
//Data structure: Stack
public:
vector<int> inorderTraversal(TreeNode* root) {
vector<int> result;
if (root == NULL) return result;
stack<TreeNode*> stk;
stk.push_back(root);
TreeNode* currNode =root->left;
while(currNode !=NULL) {
stk.push(currNode);
currNode = currNode->left;
}
while (!stk.empty()){
currNode = stk.top();
stk.pop();
result.push(currNode->val);
currNode = currNode->right;
while(currNode!=NULL){
stk.push(currNode);
currNode = currNode->left;
}
}
return result;
}
```

- PostOrder

```
//JAVA code
//Data structure: Stack
public List<Integer> postOrder(TreeNode root){
List<Integer> res = new ArrayList<>();
//sanity check
if(root == null){
return res;
}
Deque<TreeNode> stack = new ArrayDeque<>();
stack.offerLast(root);
TreeNode prev = null;
while(!stack.isEmpty()){
TreeNode curr = stack.peekLast();
if(prev == null || curr == prev.left || curr == prev.right){//case 1: curr == root || going down
if(curr.left != null){
stack.offerLast(curr.left);
} else if(curr.right != null){
stack.offerLast(curr.right);
} else{
stack.pollLast();
res.add(curr.val);
}
} else if (prev == curr.left){ // case2: going up from left subtree
if(curr.right != null){
stack.offerLast(curr.right);
}
} else { //case 3: going up from right subtree or no right subtree
stack.pollLast();
res.add(curr.val);
}
prev = curr;
}
return res;
}
```

- LevelOrder

```
//JAVA Code
//Data structure: Queue
public List<List<Integer>> levelOrder(TreeNode root){
List<List<Integer>> result = new ArrayList<>();
if(root == null) {
return result;
}
Queue<TreeNode> queue = new LinkedList<TreeNode>();
queue.offer(root);
while(!queue.isEmpty()){
List<Integer> currLevel = new ArrayList<>();
int size = queue.size();
for(int i = 0; i< size; i++) {
TreeNode curr = queue.poll();
result.add(curr.val);
if(curr.left != null) {
queue.offer(curr.left);
}
if(curr.right != null) {
queue.offer(curr.right);
}
}
result.add(currLevel);
}
return result;
}
```

- For each node, nodes in the left subtree are smaller than the current root, and nodes in the right subtree are larger than the current root.
- BST Inorder traversal: ascending order.

- Determine if a binary tree is binary search tree?

```
public boolean isBST(TreeNode root){
if(root == null){
return true;
}
return isBST(root, Integer.MIN_VALUE, Integer.MAX_VALUE);
}
private boolean isBST(TreeNode root, int leftBound, in rightBound){
if(root == null){
return true;
}
if(root.val <= leftBound || root.val >= rightBound){
return false;
}
return isBST(root.left, leftBound, root.val) && isBST(root.right, root.val, rightBound);
}
```

- In a BST, find node with cloest value to the target.

```
//Recursive Solution
//Time complexity: O(height)
//Space complexity: O(height)
public TreeNode cloestTreeNode(TreeNode root, int target){
if(root == null){
return null;
}
TreeNode next = root.key < target ? root.right : root.left;
if(next == null) {return root;}
TreeNode curr = cloestTreeNode(next, target);
return Math.abs(curr.key - target) < Math.abs(root.key - target)? curr : root;
}
//Iterative Solution
//Time complexity: O(height)
//Space complexity: O(1)
public TreeNode cloestTreeNode(TreeNode root, int target){
if(root == null){
return null;
}
TreeNode result = root;
while(root != null){
if(root.key == target){
return root;
}
result = Math.abs(root.key - target) < Math.abs(result - target) ? root : result;
root = root.key < target ? root.right : root. left;
}
return result;
}
```

- In a BST, find largest number smaller than target.

```
//Iterative Solution
//Time complexity: O(height)
//Space complexity: O(1)
public TreeNode cloestTreeNode(TreeNode root, int target){
if(root == null){
return null;
}
TreeNode result = null;
while(root != null){
if(root.key < target){//go to the right
result = root;
root = root.right;
} else {
root = root. left;
}
}
return result;
}
```

- Two Sum on BST.

```
Solution 1: populate a sorted array with inorder traversal and do 2Sum on it
- Time complexity: o(n+n)
- Space complexity: O(n)
Solution 2: use BST iterator to iterate the BST and do 2Sum.
- Time complexity: o(n)
- Space complexity: O(1)
```

Breath First Search Topological Sorting

]]>