Wikipedia Graph Database Project

Introduction

This project provides a command-line interface to interact with a graph database representing Wikipedia's Main Topic Classification categories. It allows efficient querying and exploration of the category hierarchy.

Link to Wikipedia classifications

Technologies Used
Architecture
Prerequisites
Installation
Setup
Design and Implementation
Usage
Results
Self-Evaluation
Contributing
Support
License

Technologies Used

Neo4j (version 5.20.0)
Python (version 3.12.3)

Architecture

Components

Neo4j Database: Stores the graph data representing Wikipedia classifications.
Python Scripts:
- import_data.py: Imports data from CSV files into Neo4j.
- utils.py: Provides utility functions for database operations.
- goals.py: Defines functions for various database queries.
- dbcli.py: Command-line interface for interacting with the database.
Configuration:
- config.py: Stores database connection details and other settings.

Data Flow

Data Import: import_data.py processes and imports data from taxonomy_iw.csv.gz into Neo4j.
Database Interaction: dbcli.py executes user commands using functions from goals.py.
Utility Functions: utils.py manages database connections and query execution.

Prerequisites

Python 3.12.3
Neo4j server
Virtual environment
Required Python packages: neo4j, pandas, tqdm (optional)

Installation

Installing Neo4j (Ubuntu)

sudo apt update && sudo apt upgrade -y
wget -O - https://debian.neo4j.com/neotechnology.gpg.key | sudo apt-key add -
echo 'deb https://debian.neo4j.com stable 4.x' | sudo tee /etc/apt/sources.list.d/neo4j.list
sudo apt update
sudo apt install neo4j -y

Verify installation: neo4j --version

Installing Python and Setting Up Environment

sudo apt install python3 python3-venv python3-pip -y
python3 -m venv myenv
source myenv/bin/activate
pip install neo4j pandas tqdm

Setup

Download taxonomy_iw.csv.gz to the project directory.
Download project files: config.py, utils.py, import_data.py, goals.py, dbcli.py.
Navigate to the project directory: cd dbproject
Activate the virtual environment: source myenv/bin/activate

Start Neo4j server:

sudo systemctl enable neo4j
sudo systemctl start neo4j

Set up Neo4j browser:
- Open http://localhost:7474 in your web browser.
- Set a password for the default neo4j user.
Update config.py with your Neo4j credentials.
Import data: python import_data.py

Design and Implementation

Schema Design:
- Nodes: Categories with name property
- Relationships: HAS_SUBCATEGORY between parent-child nodes
- Unique constraint on name property
- Index on name property for all nodes
Data Import:
- Batch processing with multi-threading (4 cores)
- Error handling and retries
- Progress tracking with tqdm
Query Functions:
- Implemented in goals.py
- Use Cypher queries along with Asynchronous Processing for efficient graph traversal
- Yield results for streaming
Command Line Interface:
- dbcli.py provides a user-friendly interface
- Executes queries and streams results

Usage

Activate the virtual environment and run:

python dbcli.py <goal_number> [arguments]

Available goals:

Find children of a node: python dbcli.py 1 <node_name>
Count children of a node: python dbcli.py 2 <node_name>
Find grandchildren of a node: python dbcli.py 3 <node_name>
Find parents of a node: python dbcli.py 4 <node_name>
Count parents of a node: python dbcli.py 5 <node_name>
Find grandparents of a node: python dbcli.py 6 <node_name>
Count unique nodes: python dbcli.py 7
Find root nodes: python dbcli.py 8
Find nodes with most children: python dbcli.py 9
Find nodes with least children: python dbcli.py 10
Rename a node: python dbcli.py 11 <old_name> <new_name>
Find paths between nodes: python dbcli.py 12 <start_node> <end_node> [search_depth]

Results

Detailed query results can be found in the Results folder.

Self-Evaluation

Optimization of Goal 12 (Find all paths between two given nodes)

Initial implementation faced performance issues with complex queries. Improvements made:

Search Depth Limit: Introduced a parameter to limit search depth, preventing exploration of irrelevant paths.
Asynchronous Processing: Implemented parallel path-finding from child nodes of the start node to the end node.

Performance comparison for the query from "Centuries" to "2020s_anime_films":

The optimized version with a default depth of 10 along with Asynchronous Processing efficiently identifies relevant paths quickly, balancing depth and time efficiency. Custom depth can be set for more extensive searches when time is not a constraint.