Knowledge graph in 100 lines of code

Finished knowledge graph viewed in Neo4J database

In this article I am going to show you how easy it is to create your own custom Knowledge Graph using Typescript, open-source tools, and the Neo4J database.

Knowledge graphs are getting lots of attention at the moment, as they are the natural Yin to the Yang of LLMs, providing structured data to chat interfaces and powering Retrieval Augmented Generation.

Goals

The goal is to create a knowledge graph, stored in a graph database, and to ensure that the data within the graph adheres to a data schema. After all, if we put garbage into the graph, we are likely to get garbage out!

The graph should have indexes for key properties to improve performance and we want to create vector indexes for select textual content, and use LLM-created embeddings for the textual content, allowing the graph to select nodes based on semantic (conceptual) search.

Finally, we want the overall project to be small, lightweight and easy to modify.

In a follow-up article I plan to extend the project to show how to use the graph for Retrieval Augmented Generation.

Infrastructure

We will use Neo4J as the graph database, Concerto as the data schema language, Typescript to interact with the graph via an API, and Open AI to calculate vector embeddings for text.

Create an AuraDB account and provision a free-tier database. Alternatively you can run the Neo4J community docker image locally, if you don’t want to store your data in the cloud. Save the URL, username and password for your AuraDB account.
Create an Open AI developer account and save your API key (optional).

Store the above settings in three environment variables, exported to your shell:

export NEO4J_URL=bolt+s://<AURA_DB_URL>
export NEO4J_PASS=<NEO4J_PASSWORD>
export OPENAI_API_KEY=<OPENAI API KEY>

Note: Without an Open AI key, vector embeddings for text content in the graph will not be calculated, but the graph will be created.

Project

The source code for this project is here: https://github.com/dselman/movie-graph

Define Data Model

Before defining a graph, we define the data model using the Concerto data modelling language. The model below defines nodes and edges for a simple Movie database. It includes nodes for movies, actors, directors and users who have rated movies. Movies have a textual summary that will be indexed for vector search.

const NS = 'demo.graph@1.0.0';
 
const MODEL = `
namespace ${NS}
import org.accordproject.graph@1.0.0.{GraphNode}
 
concept Address {
  o String line1
  o String line2 optional
  o String city
  o String state
  o String zip optional
  o String country
}
 
// show how a complex type gets flattened
concept ContactDetails {
  o Address address
  o String email
}
 
// show how maps get flattened
scalar PersonName extends String
scalar PersonEmail extends String
map AddressBook {
  o PersonName
  o PersonEmail
}
 
concept Person extends GraphNode {
  @label("ACTED_IN")
  --> Movie[] actedIn optional
  @label("DIRECTED")
  --> Movie[] directed optional
}
 
concept Actor extends Person {
}
 
concept Director extends Person {
}
 
concept User extends Person {
  o ContactDetails contactDetails
  o AddressBook addressBook
  @label("RATED")
  --> Movie[] ratedMovies optional
}
 
concept Genre extends GraphNode {
}
 
concept Movie extends GraphNode {
  o Double[] embedding optional
  @vector_index("embedding", 1536, "COSINE")
  o String summary optional
  @label("IN_GENRE")
  --> Genre[] genres optional
}
`;

Once the data model has been defined, use the GraphModel class to interact with the graph: that is, creating indexes and merging nodes and relationships between nodes into the graph:

async function run() {
  checkEnv('NEO4J_PASS');
  checkEnv('NEO4J_URL');
 
  const options:GraphModelOptions = {
    NEO4J_USER: process.env.NEO4J_USER,
    NEO4J_PASS: process.env.NEO4J_PASS,
    NEO4J_URL: process.env.NEO4J_URL,
    logger: console,
    logQueries: false,
    embeddingFunction: process.env.OPENAI_API_KEY ? getOpenAiEmbedding : undefined
  }
  const graphModel = new GraphModel([MODEL], options); 
  await graphModel.connect();
  await graphModel.dropIndexes();
  await graphModel.createConstraints();
  await graphModel.createVectorIndexes();
  const context = await graphModel.openSession();
 
  const { session } = context;
  await session.executeWrite(async transaction => {
    const address = {
      $class: 'demo.graph@1.0.0.Address',
      line1: '1 Main Street',
      city: 'Boulder',
      state: 'CO',
      country: 'USA'
    };
    const contactDetails = {
      $class: 'demo.graph@1.0.0.ContactDetails',
      address,
      email: 'dan@example.com'
    };
    const addressBook = {
      'Dan' : 'dan@example.com',
      'Isaac' : 'isaac@example.com'
    };
    await graphModel.mergeNode(transaction, 'Movie', {identifier: 'Brazil', summary: 'The film centres on Sam Lowry, a low-ranking bureaucrat trying to find a woman who appears in his dreams while he is working in a mind-numbing job and living in a small apartment, set in a dystopian world in which there is an over-reliance on poorly maintained (and rather whimsical) machines'} );
    await graphModel.mergeNode(transaction, 'Movie', {identifier: 'The Man Who Killed Don Quixote', summary: 'Instead of a literal adaptation, Gilliam\'s film was about "an old, retired, and slightly kooky nobleman named Alonso Quixano".'} );
    await graphModel.mergeNode(transaction, 'Movie', {identifier: 'Fear and Loathing in Las Vegas', summary: 'Duke, under the influence of mescaline, complains of a swarm of giant bats, and inventories their drug stash. They pick up a young hitchhiker and explain their mission: Duke has been assigned by a magazine to cover the Mint 400 motorcycle race in Las Vegas. They bought excessive drugs for the trip, and rented a red Chevrolet Impala convertible.'} );
 
    await graphModel.mergeNode(transaction, 'Genre', {identifier: 'Comedy'} );
    await graphModel.mergeNode(transaction, 'Genre', {identifier: 'Science Fiction'} );
 
    await graphModel.mergeRelationship(transaction, 'Movie', 'Brazil', 'Genre', 'Comedy', 'genres' );
    await graphModel.mergeRelationship(transaction, 'Movie', 'Brazil', 'Genre', 'Science Fiction', 'genres' );
    await graphModel.mergeRelationship(transaction, 'Movie', 'The Man Who Killed Don Quixote', 'Genre', 'Comedy', 'genres' );
    await graphModel.mergeRelationship(transaction, 'Movie', 'Fear and Loathing in Las Vegas', 'Genre', 'Comedy', 'genres' );
 
    await graphModel.mergeNode(transaction, 'Director', {identifier: 'Terry Gilliam'} );
    await graphModel.mergeRelationship(transaction, 'Director', 'Terry Gilliam', 'Movie', 'Brazil', 'directed' );
    await graphModel.mergeRelationship(transaction, 'Director', 'Terry Gilliam', 'Movie', 'The Man Who Killed Don Quixote', 'directed' );
    await graphModel.mergeRelationship(transaction, 'Director', 'Terry Gilliam', 'Movie', 'Fear and Loathing in Las Vegas', 'directed' );
 
    await graphModel.mergeNode(transaction, 'User', {identifier: 'Dan', contactDetails, addressBook} );
    await graphModel.mergeRelationship(transaction, 'User', 'Dan', 'Movie', 'Brazil', 'ratedMovies' );
     
    await graphModel.mergeNode(transaction, 'Actor', {identifier: 'Jonathan Pryce'} );
    await graphModel.mergeRelationship(transaction, 'Actor', 'Jonathan Pryce', 'Movie', 'Brazil', 'actedIn' );
    await graphModel.mergeRelationship(transaction, 'Actor', 'Jonathan Pryce', 'Movie', 'The Man Who Killed Don Quixote', 'actedIn' );
 
    await graphModel.mergeNode(transaction, 'Actor', {identifier: 'Johnny Depp'} );
    await graphModel.mergeRelationship(transaction, 'Actor', 'Johnny Depp', 'Movie', 'Fear and Loathing in Las Vegas', 'actedIn' );
    console.log('Created graph...');
  });
  await graphModel.closeSession(context);
  console.log('done');
}

Note that if the OPENAI_API_KEY environment variable is set, then String properties with the @vector_index decorator set will automatically have their vector embeddings calculated. To perform a semantic similarity search across nodes, use the graphModel.similarityQuery method.

The code below searches the graph for the top 3 Movie nodes with a summary that is semantically similar to ‘Working in a boring job and looking for love.’

if(process.env.OPENAI_API_KEY) {
    const search = 'Working in a boring job and looking for love.';
    console.log(`Searching for movies related to: '${search}'`);
    const results = await graphModel.similarityQuery('Movie', 'summary', search, 3);
    console.log(results); 
  }

Running the demo code using npm start should result in output similar to that below:

Connection established
{"address":"a15f52e2.databases.neo4j.io:7687","agent":"Neo4j/5.15-aura","protocolVersion":5.4}
Dropping indexes...
Drop indexes completed
Creating constraints...
Create constraints completed
Creating vector indexes...
Create vector indexes completed
EmbeddingCacheNode cache miss
Created cache node 8e40cf4e18bb0f86ca3c72f6612e315cb1a1c56f1e1991b97c72537e28b9b668
EmbeddingCacheNode cache miss
Created cache node 0b273d35894fb129ab8f2fd3d6ba38d0e8dcae669337c692166c0cf110721091
EmbeddingCacheNode cache miss
Created cache node 9da6b1d987fad60866cf18fceb0c0dce902cdc0045e65a26a2ca80c30133cc8c
Created graph...
Searching for movies related to: 'Working in a boring job and looking for love.'
EmbeddingCacheNode cache miss
Created cache node 16870ade725f46a2e205917f09123567bfa483f6b4cabcb503e4901f47eb210d
[
  {
    identifier: 'Brazil',
    content: 'The film centres on Sam Lowry, a low-ranking bureaucrat trying to find a woman who appears in his dreams while he is working in a mind-numbing job and living in a small apartment, set in a dystopian world in which there is an over-reliance on poorly maintained (and rather whimsical) machines',
    score: 0.9018157720565796
  },
  {
    identifier: 'The Man Who Killed Don Quixote',
    content: `Instead of a literal adaptation, Gilliam's film was about "an old, retired, and slightly kooky nobleman named Alonso Quixano".`,
    score: 0.8655685782432556
  },
  {
    identifier: 'Fear and Loathing in Las Vegas',
    content: 'Duke, under the influence of mescaline, complains of a swarm of giant bats, and inventories their drug stash. They pick up a young hitchhiker and explain their mission: Duke has been assigned by a magazine to cover the Mint 400 motorcycle race in Las Vegas. They bought excessive drugs for the trip, and rented a red Chevrolet Impala convertible.',
    score: 0.8640466928482056
  }
]
done

Three movies were retrieved using vector search, with the movie Brazil clearly most semantically similar to the search query ‘Working in a boring job and looking for love.’, though that exact text doesn’t appear in the summary of the movie.

You can use the AuraDB console to view the graph.

Author

Dan Selman

Distinguished Engineer, Smart Agreements

Published

May 3, 2024

Knowledge graph in 100 lines of code

Goals

Infrastructure

Project

Define Data Model

Related Blog Posts

Mastering contract efficiency: Reusing clauses and templates for document generation

Taxonomy of contract logic