I had previously shared a sneak preview of our upcoming session on Graph data processing in SQL Server. The talk is at the PASS Summit 2017. In that post, I had promised to share more details closer to the session. And here it is!
Inferring Graph Edges using SQL ML Services
In many cases, the edges in a graph are deterministic and ‘known’ to the application. In other cases, edges have to be ‘inferred’ or ‘discovered’ by some code:
- In some cases, node attributes can be used to detect similar nodes and create an edge
- In other cases, an ETL process could use fuzzy lookups etc.
- But, for more complex situations, ML Services in SQL Server 2017 and Azure SQL DB can be used as well! sp_execute_external_script can be used to invoke an R / Python script and get back a list of keys to build edges
In this walkthrough we will use ML Services in SQL Server 2017 to invoke a Python script to infer similar edges in a graph.
Approach
The nodes in this graph will be the sessions at PASS 2017 (with the data imported as per this previous post) and then we will use Python to invoke some language processing code to compute the measures of similarity between pairs of sessions, based on their Title and Abstract fields. In summary here is what we will do:
- Our database has a node table with all the sessions from PASS Summit 2017
- Sessions are saved as a Node table in SQL Graph
- Session node has attributes like Session Id, Title, Abstract, Speaker Names and Track
- Hypothesis: similar themed sessions have similar keywords in their Title / Abstract
- Using NLP libraries in Python we can break down these sessions into underlying keywords and their frequency counts
- Construct a “similarity matrix” and then return for each session, those sessions which have at least 15% similarity
- Construct edges in SQL Graph for these related session pairs
Pre-requisites
We will be leveraging two powerful Python libraries: NLTK and Gensim, to help us analyze the text and derive a measure of similarity for pairs of sessions. While NLTK comes pre-installed with SQL Server 2017 ML Services, you have to install Gensim using PIP:
pip install stop_words pip install gensim
We will then need to install a "corpus" of stop words for NLTK. This will help eliminate some common "noise" words from text to help improve the accuracy of the analysis. To do this we first create a folder for NLTK data:
md "C:Program FilesMicrosoft SQL ServerMSSQL14.SQL20171000PYTHON_SERVICESLibnltk_data"
Then we use nltk.download() to download and install the stopwords corpus as shown below. The important thing to note is to correctly escape the backslash characters in the path when providing it to the NLTK download GUI. In my case I used:
C:\Program Files\Microsoft SQL Server\MSSQL14.MSSQLSERVER\PYTHON_SERVICES\lib\nltk_data
Here's a screenshot of the above step in case you are wondering:
Once the stopwords corpus is downloaded, we proceed to create the necessary SQL node table, and then convert the previously prepared "regular" table into a Node table using INSERT...SELECT:
CREATE TABLE [dbo].[Session]( [Index] integer IDENTITY(1,1) NOT NULL, [SessionID] integer NULL, [Abstract] [nvarchar](max) NULL, [SessionLevel] [int] NULL, [Speaker1] [nvarchar](100) NULL, [Speaker2] [nvarchar](100) NULL, [Speaker3] [nvarchar](100) NULL, [Title] [nvarchar](4000) NULL, [Track] [nvarchar](50) NULL ) AS NODE GO INSERT INTO Session (SessionID, Abstract, SessionLevel, Speaker1, Speaker2, Speaker3, Title, Track) SELECT SessionID, Abstract, SessionLevel, Speaker1, Speaker2, Speaker3, Title, Track FROM dbo.PASS2017Sessions; GO
We then proceed to create an empty edge table:
CREATE TABLE SimilarSessions ( SimilarityMetric float ) AS EDGE
This table is implicitly going to hold the "from" and "to" nodes in the graph and additionally it holds a similarity measure value for that relationship.
Using Python (NLTK and Gensim) to compute session similarity
Now, that we have the tables in place, let's dig in and do the heavy lifting of text processing and analytics. Given below is the entire code which does the processing, but given it is complex, let me give you a high level flow prior to actually presenting the code. Here is what is happening in the below code:
- The session data (titles, session ID, abstract, track and an incremental index number) are provided to the Python script from a T-SQL query (that query is at the very end of this code block)
- Then NLTK is used to break down the title and abstract into words (a process called tokenization)
- We then stem and remove stop words from the tokenized words
- We then proceed to build a corpus of these words, taking only those words which have occurred at least 3 times
- Then we proceed to use TF-IDF to prepare a document matrix of these words and their frequencies in various documents
- Then, Gensim is used to compute "Matrix Similarity" which is basically a matrix of documents and how similar they are to each other.
- Once the similarity matrix is built up, we then proceed to build the output result set which maps back the SessionId values and their similarity measures
- In the above step, one interesting thing to note is that in SQL, graphs are directed. So we have to exclude situations where Session1 'is similar to' Session2 AND Session2 'is similar to' Session1.
- Once this list of unique edges is built up, it is written back into SQL as edges in the SimilarSessions graph (edge) table by using a function called rxDataStep.
A small but important nuance here with rxDataStep and specifically SQL Graph edge tables, is that you need to exactly match the $from_id and $to_id column names with the actual values (including the GUID portions) that are in the edge table. Alternatively, you can avoid using rxDataStep and insert the output of the sp_execute_external_script into a temporary table / table variable and then JOIN back to the node tables to finally insert into the graph edge table. We will look at improving this experience going forward.
Take your time to understand the code! Here we go:
exec sp_execute_external_script @language = N'Python', @script = N' #### from nltk.tokenize import RegexpTokenizer from stop_words import get_stop_words from nltk.stem.snowball import SnowballStemmer from gensim import corpora, models, similarities import gensim import pandas as pd from revoscalepy import RxSqlServerData, rx_data_step # read data back in pdDocuments = InputDataSet tokenizer = RegexpTokenizer(r"w+") en_stop = get_stop_words("en") stemmer = SnowballStemmer("english", ignore_stopwords=True) def iter_documents(pdSeries): """Iterate over all documents, yielding a document (=list of utf8 tokens) at a time.""" for (idx, docrow) in pdSeries.iterrows(): concatsessionattributes = list() concatsessionattributes.append(docrow.Title.lower()) concatsessionattributes.append(docrow.Abstract.lower()) concatsessionattributesstr = " ".join(concatsessionattributes) tokens = tokenizer.tokenize(concatsessionattributesstr) # Remove stop words from tokens stopped_tokens = [i for i in tokens if not i in en_stop] final = [stemmer.stem(word) for word in stopped_tokens] yield final class MyCorpus(object): def __init__(self, pdSeriesInput): self.series = pdSeriesInput self.dictionary = gensim.corpora.Dictionary(iter_documents(self.series)) self.dictionary.filter_extremes(no_below=3) self.dictionary.compactify() def __iter__(self): for tokens in iter_documents(self.series): yield self.dictionary.doc2bow(tokens) corp1 = MyCorpus(pdDocuments) tfidf = models.TfidfModel(corp1,id2word=corp1.dictionary, normalize=True) train_corpus_tfidf = tfidf[corp1] corpora.MmCorpus.serialize("train_ssd_corpus-tfidf.mm",train_corpus_tfidf) train_corpus_tfidf = corpora.MmCorpus("train_ssd_corpus-tfidf.mm") index = similarities.MatrixSimilarity(train_corpus_tfidf) tfidf_sims = index[train_corpus_tfidf] # print (tfidf_sims) similaritylist = [] def similarsessions(inputindex): print("Selected session: " + pdDocuments.loc[inputindex].Title) print() print("Most similar sessions are listed below:") print() topNmatches = tfidf_sims[inputindex].argsort()[-10:][::-1] for matchedsessindex in topNmatches: if (inputindex != matchedsessindex and round(tfidf_sims[inputindex][matchedsessindex] * 100, 2) > 20.0): rowdict = {} rowdict["OriginalSession"] = pdDocuments.loc[inputindex].SessionId rowdict["SimilarSession"] = pdDocuments.loc[matchedsessindex].SessionId rowdict["SimilarityMetric"] = round(tfidf_sims[inputindex][matchedsessindex] * 100, 2) # this graph effectively being a "Undirected Graph" we need to # only add a new row if there is no prior edge connecting these 2 sessions prioredgeexists = False for priorrow in similaritylist:# only add a new row if there is no prior edge connecting these 2 sessions # only add a new row if there is no prior edge connecting these 2 sessions if (priorrow["SimilarSession"] == rowdict["OriginalSession"] and priorrow["OriginalSession"] == rowdict["SimilarSession"]): prioredgeexists = True if (not prioredgeexists): similaritylist.append(rowdict) print(str(matchedsessindex) + ": " + pdDocuments.loc[matchedsessindex]["Title"] + " (" + str(round(tfidf_sims[inputindex][matchedsessindex] * 100, 2)) + "% similar)") for sessid in range(len(pdDocuments)): similarsessions(sessid) print(similaritylist.__len__()) finalresultDF = pd.DataFrame(similaritylist) # rename the DF columns to suit graph column names finalresultDF.rename(columns = {"OriginalSession":"$from_id_C19A274BF63B41359AD62328FD4E987D", "SimilarSession":"$to_id_464CF6F8A8A1406B914D18B5010D7CB1"}, inplace = True) sqlDS=RxSqlServerData(connection_string = "Driver=ODBC Driver 13 for SQL Server;Server=.\SQL2017;Database=PASS-Demo;trusted_connection=YES" , table="dbo.SimilarSessions") rx_data_step(finalresultDF, output_file = sqlDS, append = ["rows"]) ', @input_data_1 = N'SELECT CAST((ROW_NUMBER() OVER (ORDER BY (SELECT NULL))) - 1 AS INT) as RowIndex, Abstract, SessionLevel, Speaker1, Speaker2, Speaker3, Title, Track, $node_id AS SessionId FROM Session'
Once the above code is executed, the SimilarSessions table is populated with edges! Then we can query that table using regular T-SQL and the new MATCH predicate in SQL Graph. For example below we look at sessions similar to my colleague Denzil's session:
SELECT TS.SessionId, TS.Title, SimilarityMetric FROM SimilarSessions SS, [Session] OS, [Session] TS where MATCH (OS-(SS)->TS) AND (OS.SessionId = 69503) UNION ALL SELECT OS.SessionId, OS.Title, SimilarityMetric FROM SimilarSessions SS, [Session] OS, [Session] TS where MATCH (OS-(SS)->TS) AND (TS.SessionId = 69503)
Here is the output of that query:
I'm sure you will agree, looking at the above, these are greatly correlated sessions and would be a great recommendation for anyone already viewing Denzil's session!
Visualization - GraphML
Now, the last part is how to visualize the above graph in some capable tool. SQL does not ship with native visualization for graphs, and the main reason for this is that preferences on the visualization are hugely varied and we do not want to enforce anything specific from our side. Instead, we recommend using standard tools like d3.js, Gephi etc. In my case, I chose to use a very powerful tool called Cytoscape. Now, many of these tools understand a standard format for representing graphs, called GraphML. This format is XML and hence it is easy to use T-SQL to generate GraphML corresponding to our graph! Here's the code to do this:
CREATE OR ALTER PROCEDURE CreateGraphML AS BEGIN DECLARE @prolog AS NVARCHAR (MAX) = N'<?xml version=''1.0'' encoding=''utf-8''?> <graphml xmlns="http://graphml.graphdrawing.org/xmlns" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd"> <key attr.name="weight" attr.type="long" for="edge" id="d3" /> <key attr.name="SessionId" attr.type="string" for="node" id="d2" /> <key attr.name="Track" attr.type="string" for="node" id="d1" /> <key attr.name="Title" attr.type="string" for="node" id="d0" /> <graph edgedefault="undirected"> '; DECLARE @epilog AS NVARCHAR (MAX) = N' </graph> </graphml> '; DECLARE @nodeXML AS NVARCHAR (MAX) = (SELECT * FROM (SELECT 1 AS Tag, 0 AS Parent, S.SessionId AS [node!1!id], NULL AS [data!2!!element], NULL AS [data!2!key] FROM dbo.[Session] AS S UNION ALL SELECT 2 AS Tag, 1 AS Parent, S.SessionId, CONCAT(S.Title, CHAR(13), CHAR(10), CONCAT('(', S.Speaker1, IIF (S.Speaker2 IS NULL, '', CONCAT(',', Speaker2)), IIF (S.Speaker3 IS NULL, '', CONCAT(',', Speaker3)), ')'), ' [', ((SELECT COUNT(*) FROM SimilarSessions AS SS WHERE SS.$FROM_ID = S.$NODE_ID) + (SELECT COUNT(*) FROM SimilarSessions AS SS WHERE SS.$TO_ID = S.$NODE_ID)), ' connections]'), 'd0' FROM dbo.[Session] AS S UNION ALL SELECT 2 AS Tag, 1 AS Parent, S.SessionId, S.Track, 'd1' FROM dbo.[Session] AS S UNION ALL SELECT 2 AS Tag, 1 AS Parent, S.SessionId, CAST (S.SessionId AS NVARCHAR (200)), 'd2' FROM dbo.[Session] AS S) AS InnerTable ORDER BY [node!1!id], [data!2!!element] FOR XML EXPLICIT); DECLARE @edgeXML AS NVARCHAR (MAX); WITH Edges AS (SELECT OS.SessionId AS source, TS.SessionId AS target, CAST (SS.SimilarityMetric AS INT) AS data FROM SimilarSessions AS SS, [Session] AS OS, [Session] AS TS WHERE MATCH(OS-(SS)->TS)) SELECT @edgeXML = (SELECT * FROM (SELECT 1 AS Tag, 0 AS Parent, source AS [edge!1!source], target AS [edge!1!target], NULL AS [data!2!!element], NULL AS [data!2!key] FROM Edges UNION ALL SELECT 2 AS Tag, 1 AS Parent, source, target, data, 'd3' FROM Edges) AS InnerTable ORDER BY [edge!1!source], [edge!1!target], [data!2!!element] FOR XML EXPLICIT); SELECT CONCAT(@prolog, @nodeXML, @edgeXML, @epilog); END GO EXEC CreateGraphML; GO /* Run from CMD prompt: bcp "EXEC CreateGraphML" queryout PASS2017.xml -T -S.SQL2017 -dPASS-Demo -C65001 -c */
And that's it! When you run the BCP command line from CMD prompt, it will create a PASS2017.xml file, which is internally in the GraphML format. That's easily imported into Cytoscape or other such graph visualization tools. And that is how we created the fun visualization that you saw in the "sneak preview" blog post!
Disclaimer