
Study project for big data (Hadoop, Zookeeper, Kafka, Flink, Spark)

Features ✨

Supported Technologies:

  • Hadoop 3.3.6 (with JDK 8.0.352-zulu, Maven 3.6.3)
    • Zookeeper 3.9.2
    • Kafka 2.12-3.7.1

Installation 📦

  1. Clone the repository:
    git clone --depth=1 && cd mcdd-big-data-study
    1. Build the Docker image:
      cd docker
      docker build -t caobaoqi1029/big-data-study:x.x.x .

Note: Replace x.x.x with the appropriate version number.

  1. Start the containers:
    docker compose up -d

Configuration 🛠

  1. Connect to the remote server via VS Code and attach to a running container.
  1. Install the Java Dev extension in VS Code.
  1. Restart the extension host to apply changes.
  1. Initialize Hadoop environment:
    docker exec -it master bash
    hdfs namenode -format
  1. Start Hadoop services:
  1. Use the following commands to interact with Hadoop:
    vim input.txt
    hdfs dfs -put -f ./input.txt /
    hdfs dfs -ls /
  1. Build and run the Hadoop job:
    mvn clean package
    cd target/
    hadoop jar big-data.jar

Tip: You can set the environment variable to run Java directly:

# Add this to .bashrc for persistence.
  1. View the output:
    hdfs dfs -ls /output
    hdfs dfs -cat /output/part-r-00000

Contributing 🤝

