Load Data Into Neo4j

praveenr - Aug 18 - - Dev Community

In the previous blog we saw how to install and setup neo4j locally with 2 plugins APOC and Graph Data Science Library - GDS. In this blog I am going to take a toy dataset(products in a e-commerce website) and store that in Neo4j.

 

Allocating Sufficient Memory For Neo4j

Before starting to load the data if in your use case you have huge data ensure that sufficient amount of memory is allocated to neo4j. To do that :

  • Click on the three dots to the right of open

Three dots

  • Click on Open folder -> Configuration

Configuration

  • Click on neo4j.conf

neo4j conf

  • Search for heap in neo4j.conf, uncomment lines 77, 78 and change 256m to 2048m, this ensures 2048mb is allocated for data storage in neo4j.

Memory

 
 

Creating Nodes

  • Graphs have two primary components nodes and relationships, let's create the nodes first and later establish the relationships.

  • The data I am using is present here - data

  • Use the requirements.txt present here to create a python virtual environment - requirements.txt

  • Let's define various functions to push data.

  • Importing necessary libraries

import pandas as pd
from neo4j import GraphDatabase
from openai import OpenAI
Enter fullscreen mode Exit fullscreen mode
  • We are going to use openai to generate embeddings
client = OpenAI(api_key="")
product_data_df = pd.read_csv('../data/product_data.csv')
Enter fullscreen mode Exit fullscreen mode
  • To generate embeddings
def get_embedding(text):
    """
    Used to generate embeddings using OpenAI embeddings model
    :param text: str - text that needs to be converted to embeddings
    :return: embedding
    """
    model = "text-embedding-3-small"
    text = text.replace("\n", " ")
    return client.embeddings.create(input=[text], model=model).data[0].embedding
Enter fullscreen mode Exit fullscreen mode
  • As per our dataset we can have two unique node labels, Product_type : Type/Category of product, Product_details: Name of product. Let's create category label, neo4j offers something called property, you can imagine these to be metadata for a particular node. Here name and embedding are the properties. So we are storing the name of category and its corresponding embedding in DB.
def create_product_type(product_data_df):
    """
    Used to generate queries for creating product type nodes in neo4j
    :param product_data_df: pandas dataframe - data
    :return: query_list: list - list containing all create node queries for category
    """
    cat_query = """CREATE (a:Product_type {name: '%s', embedding: %s})"""
    distinct_product_types = product_data_df['Category'].unique()
    query_list = []
    for type_ in distinct_product_types:
        embedding = get_embedding(type_)
        query_list.append(cat_query % (type_, embedding))
    return query_list
Enter fullscreen mode Exit fullscreen mode
  • Similarly we can create Product_details nodes, here the properties would be name, description, price, warranty_period, available_stock, review_rating, product_release_date, embedding
def create_product(product_data_df):
    """
    def create_product_details(product_data_df):
    """
    Used to generate queries for creating product_details nodes in neo4j
    :param product_data_df: pandas dataframe - data
    :return: query_list: list - list containing all create node queries for product
    """
    product_query = """CREATE (a:Product_details {name: '%s', description: '%s', price: %d, warranty_period: %d, 
    available_stock: %d, review_rating: %f, product_release_date: date('%s'), embedding: %s})"""
    query_list = []
    for idx, row in product_data_df.iterrows():
        embedding = get_embedding(row['Product Name'] + " - " + row['Description'])
        query_list.append(product_query % (row['Product Name'], row['Description'], int(row['Price (INR)']),
                                           int(row['Warranty Period (Years)']), int(row['Stock']),
                                           float(row['Review Rating']), str(row['Product Release Date']), embedding))
    return query_list
Enter fullscreen mode Exit fullscreen mode
  • Now let's create another function to execute the queries generated by the above 2 functions. Update your username and password appropriately.
def execute_bulk_query(query_list):
    """
    Executes queries is a list one by one
    :param query_list: list - list of cypher queries
    :return: None
    """
    url = "bolt://localhost:7687"
    auth = ("neo4j", "neo4j@123")

    with GraphDatabase.driver(url, auth=auth) as driver:
        with driver.session() as session:
            for query in query_list:
                try:
                    session.run(query)
                except Exception as error:
                    print(f"Error in executing query - {query}, Error - {error}")
Enter fullscreen mode Exit fullscreen mode
  • Complete code
import pandas as pd
from neo4j import GraphDatabase
from openai import OpenAI

client = OpenAI(api_key="")
product_data_df = pd.read_csv('../data/product_data.csv')


def preprocessing(df, columns_to_replace):
    """
    Used to preprocess certain column in dataframe
    :param df: pandas dataframe - data
    :param columns_to_replace: list - column name list
    :return: df: pandas dataframe - processed data
    """
    df[columns_to_replace] = df[columns_to_replace].apply(lambda col: col.str.replace("'s", "s"))
    df[columns_to_replace] = df[columns_to_replace].apply(lambda col: col.str.replace("'", ""))
    return df


def get_embedding(text):
    """
    Used to generate embeddings using OpenAI embeddings model
    :param text: str - text that needs to be converted to embeddings
    :return: embedding
    """
    model = "text-embedding-3-small"
    text = text.replace("\n", " ")
    return client.embeddings.create(input=[text], model=model).data[0].embedding


def create_product_type(product_data_df):
    """
    Used to generate queries for creating product type nodes in neo4j
    :param product_data_df: pandas dataframe - data
    :return: query_list: list - list containing all create node queries for category
    """
    cat_query = """CREATE (a:Product_type {name: '%s', embedding: %s})"""
    distinct_product_types = product_data_df['Category'].unique()
    query_list = []
    for type_ in distinct_product_types:
        embedding = get_embedding(type_)
        query_list.append(cat_query % (type_, embedding))
    return query_list


def create_product_details(product_data_df):
    """
    Used to generate queries for creating product_details nodes in neo4j
    :param product_data_df: pandas dataframe - data
    :return: query_list: list - list containing all create node queries for product
    """
    product_query = """CREATE (a:Product_details {name: '%s', description: '%s', price: %d, warranty_period: %d, 
    available_stock: %d, review_rating: %f, product_release_date: date('%s'), embedding: %s})"""
    query_list = []
    for idx, row in product_data_df.iterrows():
        embedding = get_embedding(row['Product Name'] + " - " + row['Description'])
        query_list.append(product_query % (row['Product Name'], row['Description'], int(row['Price (INR)']),
                                           int(row['Warranty Period (Years)']), int(row['Stock']),
                                           float(row['Review Rating']), str(row['Product Release Date']), embedding))
    return query_list


def execute_bulk_query(query_list):
    """
    Executes queries is a list one by one
    :param query_list: list - list of cypher queries
    :return: None
    """
    url = "bolt://localhost:7687"
    auth = ("neo4j", "neo4j@123")

    with GraphDatabase.driver(url, auth=auth) as driver:
        with driver.session() as session:
            for query in query_list:
                try:
                    session.run(query)
                except Exception as error:
                    print(f"Error in executing query - {query}, Error - {error}")


# PREPROCESSING
product_data_df = preprocessing(product_data_df, ['Product Name', 'Description'])

# CREATE PRODUCT TYPE
query_list = create_product_type(product_data_df)
execute_bulk_query(query_list)

# CREATE PRODUCT DETAIL
query_list = create_product_details(product_data_df)
execute_bulk_query(query_list)


Enter fullscreen mode Exit fullscreen mode

 
 

Creating Relationships

  • We are going to create relationships between Product_type and Product_details and the name of the relationship would be CONTAINS
from neo4j import GraphDatabase
import pandas as pd

product_data_df = pd.read_csv('../data/product_data.csv')


def preprocessing(df, columns_to_replace):
    """
    Used to preprocess certain column in dataframe
    :param df: pandas dataframe - data
    :param columns_to_replace: list - column name list
    :return: df: pandas dataframe - processed data
    """
    df[columns_to_replace] = df[columns_to_replace].apply(lambda col: col.str.replace("'s", "s"))
    df[columns_to_replace] = df[columns_to_replace].apply(lambda col: col.str.replace("'", ""))
    return df


def create_type_detail_relationship_query(product_data_df):
    """
    Used to create relationship between Product_type and Product_details
    :param product_data_df: dataframe - data
    :return: query_list: list - cypher queries
    """
    query = """MATCH (c:Product_type {name: '%s'}), (p:Product_details {name: '%s'}) CREATE (c)-[:CONTAINS]->(p)"""
    query_list = []
    for idx, row in product_data_df.iterrows():
        query_list.append(query % (row['Category'], row['Product Name']))
    return query_list


def execute_bulk_query(query_list):
    """
    Executes queries is a list one by one
    :param query_list: list - list of cypher queries
    :return: None
    """
    url = "bolt://localhost:7687"
    auth = ("neo4j", "neo4j@123")

    with GraphDatabase.driver(url, auth=auth) as driver:
        with driver.session() as session:
            for query in query_list:
                try:
                    session.run(query)
                except Exception as error:
                    print(f"Error in executing query - {query}, Error - {error}")


# PREPROCESSING
product_data_df = preprocessing(product_data_df, ['Product Name', 'Description'])

# CATEGORY - FOOD RELATIONSHIP
query_list = create_type_detail_relationship_query(product_data_df)
execute_bulk_query(query_list)

Enter fullscreen mode Exit fullscreen mode
  • By using MATCH query to match the already created nodes we establish relationships between then.

 
 

Visualizing The Created Nodes

Hover over the open icon and click on neo4j browser to visualize the nodes that we have created.
Neo4j browser

Neo4j browser 2

Graph view

And our data is loaded into neo4j along with their embeddings.

 
In the fore-coming blogs we'll see how to build a graph query engine using python and use the fetched data to do augmented generation.

Hope this helps... See you !!!

LinkedIn - https://www.linkedin.com/in/praveenr2998/
Github - https://github.com/praveenr2998/Creating-Lightweight-RAG-Systems-With-Graphs/tree/main/push_data_to_db

. . . . . . . . . . . . . .
Terabox Video Player