{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# heterograph: queries\n", "\n", "Heterograph provides functionality to identify specific patterns, or subgraphs, within larger graphs using DFS (Depth-First Search). This is achieved through the use of a `qgraph` (query graph). The `qgraph` encodes a subgraph pattern, which is then used as a query to locate instances of this pattern within a larger graph. For example, in a graph that represents a C++ program, a qgraph could be used to identify all the loops. To construct these qgraphs, Heterograph utilizes a query language known as 'AQL'. We use this facility in Artisan, but it is kept generic for other uses.\n", "\n", "## Query Graphs (qgraph)\n", "\n", "### AQL grammar\n", "\n", "AQL has the following syntax:\n", " \n", "- **identifiers**: Start with a letter, followed by letters, numbers, or underscores.\n", " - Example: `person`, `node1`\n", "\n", "- **arguments**: \n", " - Single or comma-separated values, e.g.: `\"Alice\"`, `30, \"Engineer\"`\n", " - Key-Value Pairs: Use `key: value`, e.g.: `name: \"Alice\"`, `age: 30`\n", "\n", "- **nodes**: Identifier alone or with arguments.\n", " - Example: `person`, `person { name: \"Alice\", age: 30 }`\n", "\n", "- **edges**: Simple (`=>`) or with arguments (`={ args }>`).\n", " - Example: `=>`, `={ weight: 5 }>`, `={ 10, 12 }>`\n", "\n", "- **graphs**: Group of graphs, single node, or connected graphs.\n", " - Example: \n", " - `0` # empty graph\n", " - `node1 => node2` # edge between node1 and node2\n", " - `(node1 | node2) => node3` # node1 and node2 are both connected to node3\n", "\n", "### bulding qgraphs" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "0\n", "\n", "\n", "\n", "A:\n", "0 \n", "\n", "\n", "\n", "1\n", "\n", "\n", "\n", "B:\n", "1 \n", "\n", "\n", "\n", "0->1\n", "\n", "\n", "\n", "\n", "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from heterograph.query.qgraph import QGraph\n", "\n", "QGraph(pattern='A => B').view()" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "0\n", "\n", "\n", "\n", "A:\n", "0 \n", "\n", "\n", "\n", "2\n", "\n", "\n", "\n", "C:\n", "2 \n", "\n", "\n", "\n", "0->2\n", "\n", "\n", "\n", "\n", "\n", "1\n", "\n", "\n", "\n", "B:\n", "1 \n", "\n", "\n", "\n", "1->2\n", "\n", "\n", "\n", "\n", "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "QGraph(pattern='(A | B) => C').view()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "0\n", "\n", "\n", "\n", "X:\n", "0 \n", "\n", "\n", "\n", "1\n", "\n", "\n", "\n", "A:\n", "1 \n", "\n", "\n", "\n", "0->1\n", "\n", "\n", "\n", "\n", "\n", "2\n", "\n", "\n", "\n", "B:\n", "2 \n", "\n", "\n", "\n", "0->2\n", "\n", "\n", "\n", "\n", "\n", "3\n", "\n", "\n", "\n", "C:\n", "3 \n", "\n", "\n", "\n", "1->3\n", "\n", "\n", "\n", "\n", "\n", "4\n", "\n", "\n", "\n", "D:\n", "4 \n", "\n", "\n", "\n", "1->4\n", "\n", "\n", "\n", "\n", "\n", "2->3\n", "\n", "\n", "\n", "\n", "\n", "2->4\n", "\n", "\n", "\n", "\n", "\n", "5\n", "\n", "\n", "\n", "E:\n", "5 \n", "\n", "\n", "\n", "3->5\n", "\n", "\n", "\n", "\n", "\n", "4->5\n", "\n", "\n", "\n", "\n", "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "QGraph(pattern='X=>(A|B)=>(C|D)=>E').view()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "0\n", "\n", "\n", "\n", "a:\n", "0 ('name': 'A')\n", "\n", "\n", "\n", "1\n", "\n", "\n", "\n", "b:\n", "1 ('name': 'B')\n", "\n", "\n", "\n", "0->1\n", "\n", "\n", "('d': True)\n", "\n", "\n", "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "QGraph(pattern='a{name:\"A\"} ={d:True}>b{name:\"B\"}').view()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Query Processor DFS\n", "\n", "The `QueryProcessorDFS` class is utilized to identify specific subgraph patterns within a larger, arbitrary graph (`HGraph`). After creating an instance of this class, the `run` method can be invoked with the appropriate parameters to execute the search:\n", "- `g`: This is the graph to be queried. It should be an instance of HGraph.\n", "- `qgraph`: This is the query graph, an instance of QGraph, which represents the pattern you're looking for in the main graph.\n", "- `vs`: This is a list of vertices from which the query should start (roots)\n", "- `path_check`: This is a function that checks the validity of a path. It should take a path as input and return a boolean indicating whether the path is valid.\n", "- `match_filter`: This is a function that filters the matches. It should take a match as input and return a boolean indicating whether the match should be included in the results.\n", "- `fd`: This is an optional parameter that specifies the distance between the root (specified in `vs`) and first node of a match. It can be an integer or a tuple of two integers. If it is an integer, the distance must be exactly that integer. If it is a tuple, the distance must be within the range specified by the tuple.\n", "- `gd`: This is an optional parameter that specifies the distance between the root (specified in `vs`) and any node of a match. It can be an integer or a tuple of two integers. If it is an integer, the distance must be exactly that integer. If it is a tuple, the distance must be within the range specified by the tuple.\n", "\n", "As an example, consider the following graph where each vertex has a label and a type (circle or rect)." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "0\n", "\n", "A:0\n", "\n", "\n", "\n", "1\n", "\n", "B:1\n", "\n", "\n", "\n", "0->1\n", "\n", "\n", "\n", "\n", "\n", "2\n", "\n", "C:2\n", "\n", "\n", "\n", "0->2\n", "\n", "\n", "\n", "\n", "\n", "3\n", "\n", "D:3\n", "\n", "\n", "\n", "0->3\n", "\n", "\n", "\n", "\n", "\n", "4\n", "\n", "E:4\n", "\n", "\n", "\n", "1->4\n", "\n", "\n", "\n", "\n", "\n", "2->4\n", "\n", "\n", "\n", "\n", "\n", "3->4\n", "\n", "\n", "\n", "\n", "\n", "5\n", "\n", "F:5\n", "\n", "\n", "\n", "4->5\n", "\n", "\n", "\n", "\n", "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from heterograph import *\n", "\n", "g = HGraph()\n", "for i in range(0, 6):\n", " vx = g.add_vx()\n", " g.pmap[vx]['name'] = chr(i+ord('A'))\n", " g.pmap[vx]['type'] = \"circle\" if i % 2 == 0 else \"rect\"\n", "\n", " g.vstyle['label'] = lambda g, vx: f\"{g.pmap[vx]['name']}:{vx}\"\n", " g.vstyle['shape'] = lambda g, vx: g.pmap[vx]['type']\n", "\n", "g.add_edge(0, [1, 2, 3])\n", "g.add_edge([1,2,3], 4)\n", "g.add_edge(4, 5)\n", "g.view()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For our example, we develop a query system that matches specific constraints, namely label and type. The initial step in this process involves transferring AQL arguments into the vertices of the query graph. For this, we create the `vx_args` function which is be passed to `QGraph` constructor when we create a query." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# we will pass this function when we create a QGraph\n", "def vx_args(qg, vx, *, name=None, type=None):\n", " qg.pmap[vx]['name'] = name\n", " qg.pmap[vx]['type'] = type" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we set up the query processor. This involves defining two optional functions: `path_check` and `match_filter`. The `path_check` function is used to verify if any two vertices from the query graph correspond to two vertices in the main graph. If `True` then then become part of the match. The `match_filter` function is used to filter out the complete matches based on certain criteria. \n", "\n", "In our example, we use `path_check` to check if vertex constraints, such as name or type match." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "from heterograph.query.processor_dfs import QueryProcessorDFS\n", "\n", "def path_check(g, qg, gchain, qgchain):\n", " # This allows for a comparison between two vertices that form a path (q_vpair) and\n", " # two vertices from the query graph (g_vpair). For example, it can be used to\n", " # implement distance constraints between two vertices. In this case, we check the\n", " # vertex contraints match\n", "\n", " vx = gchain[1]\n", " qvx = qgchain[1]\n", "\n", " for attr in ['name', 'type']:\n", " if qg.pmap[qvx][attr] is not None:\n", " # match the exact attribute value, we could\n", " # use regex or other matching\n", " if g.pmap[vx][attr] != qg.pmap[qvx][attr]:\n", " return False\n", "\n", " return True\n", "\n", "processor = QueryProcessorDFS()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let us create a query: find all `circle` vertices." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[{'x': 4}, {'x': 4}, {'x': 2}, {'x': 4}, {'x': 0}]\n" ] } ], "source": [ "qg=QGraph('x{type:\"circle\"}', vx_args=vx_args)\n", "\n", "matches=processor.run(g, qg, path_check=path_check)\n", "print(matches)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each match corresponds to a mapping from a vertex in the query graph (qgraph) to a vertex in the main graph. It is important to note, as shown above, that duplicates may occur. For this purpose, the `QueryResultSe`t class is used to refine these results and optionally perform actions on them. One such action is the removal of identical results, which can be achieved using the `RSet::distinct` method." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " x\n", "---\n", " \u001b[90m4\u001b[0m\n", " \u001b[90m2\u001b[0m\n", " \u001b[90m0\u001b[0m\n" ] } ], "source": [ "from heterograph.query.resultset import QueryResultSet\n", "from heterograph.query.rsutils import RSet\n", "\n", "qrs = QueryResultSet(g, qg, matches)\n", "qrs = RSet.distinct(qrs)\n", "print(qrs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let us do a more complex query and find a circle vertex connected to a vertex of any type." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ " x y\n", "--- ---\n", " \u001b[90m4\u001b[0m \u001b[90m5\u001b[0m\n", " \u001b[90m2\u001b[0m \u001b[90m5\u001b[0m\n", " \u001b[90m2\u001b[0m \u001b[90m4\u001b[0m\n", " \u001b[90m0\u001b[0m \u001b[90m5\u001b[0m\n", " \u001b[90m0\u001b[0m \u001b[90m4\u001b[0m\n", " \u001b[90m0\u001b[0m \u001b[90m1\u001b[0m\n", " \u001b[90m0\u001b[0m \u001b[90m2\u001b[0m\n", " \u001b[90m0\u001b[0m \u001b[90m3\u001b[0m" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "qg=QGraph('x{type:\"circle\"}=>y', vx_args=vx_args)\n", "\n", "matches=processor.run(g, qg, path_check=path_check)\n", "RSet.distinct(QueryResultSet(g, qg, matches))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Please note that each entry in the list of matches is independent. Furthermore, although the distance between vertices `x` and `y` is not explicitly constrained, it is guaranteed that `y` is part of the path when `x` is visited. To specify distance constraints, you can add a constraint to the edge in the query graph (qgraph). The `eg_args` parameter can then be used to translate AQL edge arguments into the qgraph edge property map, similar to how `vx_args` is used for vertices.\n", "\n", "The `QueryProcessorDFS` class includes two optional global parameters, `fd` and `gd`, which filters out matches.\n", "\n", "The `fd` parameter is a tuple (min, max) that defines the minimum and maximum distance between the root vertex, where the search begins, and the first element found in the match. If a single integer `d` is provided, it is interpreted as both the minimum and maximum distance.\n", "\n", "For the above example, the search starts with root vertex 0 (\"A\"). So if we specify fd=1, then we have:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " x y\n", "--- ---\n", " \u001b[90m4\u001b[0m \u001b[90m5\u001b[0m\n", " \u001b[90m2\u001b[0m \u001b[90m5\u001b[0m\n", " \u001b[90m2\u001b[0m \u001b[90m4\u001b[0m\n", " \u001b[90m0\u001b[0m \u001b[90m5\u001b[0m\n", " \u001b[90m0\u001b[0m \u001b[90m4\u001b[0m\n", " \u001b[90m0\u001b[0m \u001b[90m1\u001b[0m\n", " \u001b[90m0\u001b[0m \u001b[90m2\u001b[0m\n", " \u001b[90m0\u001b[0m \u001b[90m3\u001b[0m\n" ] }, { "data": { "text/plain": [ " x y\n", "--- ---\n", " \u001b[90m2\u001b[0m \u001b[90m5\u001b[0m\n", " \u001b[90m2\u001b[0m \u001b[90m4\u001b[0m" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# without \"fd\" constraint\n", "matches=processor.run(g, qg, path_check=path_check)\n", "print(RSet.distinct(QueryResultSet(g, qg, matches)))\n", "\n", "# with fd=1. x=2 is exactly 1 edge away from the root vertex 0\n", "matches=processor.run(g, qg, path_check=path_check, fd=1)\n", "RSet.distinct(QueryResultSet(g, qg, matches))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "On the other hand, the `gd` parameter follows the same distance type as `fd`, that is (int,int) or int. It provides the distance between root vertex 0 (\"A\") where the search starts and any node in a match. In this example, we specify the `gd` distance between (0, 2). " ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " x y\n", "--- ---\n", " \u001b[90m4\u001b[0m \u001b[90m5\u001b[0m\n", " \u001b[90m2\u001b[0m \u001b[90m5\u001b[0m\n", " \u001b[90m2\u001b[0m \u001b[90m4\u001b[0m\n", " \u001b[90m0\u001b[0m \u001b[90m5\u001b[0m\n", " \u001b[90m0\u001b[0m \u001b[90m4\u001b[0m\n", " \u001b[90m0\u001b[0m \u001b[90m1\u001b[0m\n", " \u001b[90m0\u001b[0m \u001b[90m2\u001b[0m\n", " \u001b[90m0\u001b[0m \u001b[90m3\u001b[0m\n" ] }, { "data": { "text/plain": [ " x y\n", "--- ---\n", " \u001b[90m2\u001b[0m \u001b[90m4\u001b[0m\n", " \u001b[90m0\u001b[0m \u001b[90m4\u001b[0m\n", " \u001b[90m0\u001b[0m \u001b[90m1\u001b[0m\n", " \u001b[90m0\u001b[0m \u001b[90m2\u001b[0m\n", " \u001b[90m0\u001b[0m \u001b[90m3\u001b[0m" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# without \"gd\" constraint\n", "matches=processor.run(g, qg, path_check=path_check)\n", "print(RSet.distinct(QueryResultSet(g, qg, matches)))\n", "\n", "# with gd=(0, 2). Note that (4, 5) and (2, 5), for instance, do not meet the distance constraint\n", "# since vertex 5 has a distance of 3 from the root vertex 0.\n", "matches=processor.run(g, qg, path_check=path_check, gd=(0, 2))\n", "RSet.distinct(QueryResultSet(g, qg, matches))" ] } ], "metadata": { "kernelspec": { "display_name": "metaml", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.9" } }, "nbformat": 4, "nbformat_minor": 2 }