graph-database-project

The Project is designed to interact with a graph database representing Wikipedia Main topic Classification categories. It has a command Line utility feature that allows us to query the database efficiently and seamlessly.

MIT License

Stars
0
Committers
1

Wikipedia Graph Database Project

Introduction

This project provides a command-line interface to interact with a graph database representing Wikipedia's Main Topic Classification categories. It allows efficient querying and exploration of the category hierarchy.

Link to Wikipedia classifications

Table of Contents

  1. Technologies Used
  2. Architecture
  3. Prerequisites
  4. Installation
  5. Setup
  6. Design and Implementation
  7. Usage
  8. Results
  9. Self-Evaluation
  10. Contributing
  11. Support
  12. License

Technologies Used

  • Neo4j (version 5.20.0)
  • Python (version 3.12.3)

Architecture

Components

  • Neo4j Database: Stores the graph data representing Wikipedia classifications.
  • Python Scripts:
    • import_data.py: Imports data from CSV files into Neo4j.
    • utils.py: Provides utility functions for database operations.
    • goals.py: Defines functions for various database queries.
    • dbcli.py: Command-line interface for interacting with the database.
  • Configuration:
    • config.py: Stores database connection details and other settings.

Data Flow

  1. Data Import: import_data.py processes and imports data from taxonomy_iw.csv.gz into Neo4j.
  2. Database Interaction: dbcli.py executes user commands using functions from goals.py.
  3. Utility Functions: utils.py manages database connections and query execution.

Prerequisites

  • Python 3.12.3
  • Neo4j server
  • Virtual environment
  • Required Python packages: neo4j, pandas, tqdm (optional)

Installation

Installing Neo4j (Ubuntu)

sudo apt update && sudo apt upgrade -y
wget -O - https://debian.neo4j.com/neotechnology.gpg.key | sudo apt-key add -
echo 'deb https://debian.neo4j.com stable 4.x' | sudo tee /etc/apt/sources.list.d/neo4j.list
sudo apt update
sudo apt install neo4j -y

Verify installation: neo4j --version

Installing Python and Setting Up Environment

sudo apt install python3 python3-venv python3-pip -y
python3 -m venv myenv
source myenv/bin/activate
pip install neo4j pandas tqdm

Setup

  1. Download taxonomy_iw.csv.gz to the project directory.
  2. Download project files: config.py, utils.py, import_data.py, goals.py, dbcli.py.
  3. Navigate to the project directory: cd dbproject
  4. Activate the virtual environment: source myenv/bin/activate
  5. Start Neo4j server:
    sudo systemctl enable neo4j
    sudo systemctl start neo4j
    
  6. Set up Neo4j browser:
    • Open http://localhost:7474 in your web browser.
    • Set a password for the default neo4j user.
  7. Update config.py with your Neo4j credentials.
  8. Import data: python import_data.py

Design and Implementation

  1. Schema Design:

    • Nodes: Categories with name property
    • Relationships: HAS_SUBCATEGORY between parent-child nodes
    • Unique constraint on name property
    • Index on name property for all nodes
  2. Data Import:

    • Batch processing with multi-threading (4 cores)
    • Error handling and retries
    • Progress tracking with tqdm
  3. Query Functions:

    • Implemented in goals.py
    • Use Cypher queries along with Asynchronous Processing for efficient graph traversal
    • Yield results for streaming
  4. Command Line Interface:

    • dbcli.py provides a user-friendly interface
    • Executes queries and streams results

Usage

Activate the virtual environment and run:

python dbcli.py <goal_number> [arguments]

Available goals:

  1. Find children of a node: python dbcli.py 1 <node_name>
  2. Count children of a node: python dbcli.py 2 <node_name>
  3. Find grandchildren of a node: python dbcli.py 3 <node_name>
  4. Find parents of a node: python dbcli.py 4 <node_name>
  5. Count parents of a node: python dbcli.py 5 <node_name>
  6. Find grandparents of a node: python dbcli.py 6 <node_name>
  7. Count unique nodes: python dbcli.py 7
  8. Find root nodes: python dbcli.py 8
  9. Find nodes with most children: python dbcli.py 9
  10. Find nodes with least children: python dbcli.py 10
  11. Rename a node: python dbcli.py 11 <old_name> <new_name>
  12. Find paths between nodes: python dbcli.py 12 <start_node> <end_node> [search_depth]

Results

Detailed query results can be found in the Results folder.

Self-Evaluation

Optimization of Goal 12 (Find all paths between two given nodes)

Initial implementation faced performance issues with complex queries. Improvements made:

  1. Search Depth Limit: Introduced a parameter to limit search depth, preventing exploration of irrelevant paths.
  2. Asynchronous Processing: Implemented parallel path-finding from child nodes of the start node to the end node.

Performance comparison for the query from "Centuries" to "2020s_anime_films":

The optimized version with a default depth of 10 along with Asynchronous Processing efficiently identifies relevant paths quickly, balancing depth and time efficiency. Custom depth can be set for more extensive searches when time is not a constraint.

Contributing

Contributions to improve this project are welcome. Please submit pull requests or open issues in the project repository.

Support

For assistance or inquiries, please open an issue in the project's issue tracker or contact [email protected].

License

This project is licensed under the MIT License. See the LICENSE file for details.

Badges
Extracted from project README's
Python Neo4j
Related Projects