Open Source Ecosystems

Features ✨

Supported Technologies:

Hadoop 3.3.6 (with JDK 8.0.352-zulu, Maven 3.6.3)

Zookeeper 3.9.2

Kafka 2.12-3.7.1

Installation 📦

Clone the repository:

git clone https://github.com/mcddhub/mcdd-big-data-study.git --depth=1 && cd mcdd-big-data-study

Build the Docker image:

cd docker
docker build -t caobaoqi1029/big-data-study:x.x.x .

Note: Replace x.x.x with the appropriate version number.

Start the containers:
```
docker compose up -d
```

Configuration 🛠

Connect to the remote server via VS Code and attach to a running container.

Install the Java Dev extension in VS Code.

Restart the extension host to apply changes.

Initialize Hadoop environment:

docker exec -it master bash
hdfs namenode -format

Start Hadoop services:
```
start-all.sh
```

Use the following commands to interact with Hadoop:

vim input.txt
hdfs dfs -put -f ./input.txt /
hdfs dfs -ls /

Build and run the Hadoop job:

mvn clean package
cd target/
hadoop jar big-data.jar

Tip: You can set the environment variable to run Java directly:
export CLASSPATH=$CLASSPATH:/tmp/
# Add this to .bashrc for persistence.

View the output:

hdfs dfs -ls /output
hdfs dfs -cat /output/part-r-00000

Contributing 🤝

We welcome contributions! Feel free to submit a pull request. For more details, see the Contribution Guide.

License 📄

This project is licensed under the MIT License. See the LICENSE file for details.

Support 💖

If you find this project helpful, consider giving it a ⭐️ on GitHub!

Star History ⭐

Badges

Extracted from project README's

Related Projects

fasttrackml

Experiment tracking server focused on speed and scalability

30 Mar 2023 95

data_lakehouse_local_stack

Data Lakehouse local stack with PySpark, Trino, and Minio. Includes an example to process Raygun ...

21 Jun 2024 0

incubator-xtable

Apache XTable (incubating) is a cross-table converter for lakehouse table formats that facilitate...

21 Jul 2023 850

spark

Apache Spark - A unified analytics engine for large-scale data processing

25 Feb 2014 38,255

analytics-zoo

Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray

05 Mar 2024 9

bigdata-playground

A complete example of a big data application using : Kubernetes (kops/aws), Apache Spark SQL/Stre...

12 Dec 2017 208

Masters-Thesis-on-Big-Data

Master's thesis on Big Data

01 Feb 2022 33

utils4s

scala、spark使用过程中，各种测试用例以及相关资料整理

24 Sep 2015 1,089

apache-spark-docker

Dockerizing an Apache Spark Standalone Cluster

19 Jul 2021 40

PapersLab

The project aims to automate content classification and knowledge retrieval, as well as to perfor...

02 Apr 2024 1

yelp_dataset

Sample analysis for the latest yelp dataset using spark

08 Sep 2017 7

cdap

An open source framework for building data analytic applications.

02 Aug 2014 735

maven-apache-parent

Apache Software Foundation Parent POM

04 Nov 2017 35

Dockerfiles

50+ DockerHub public images for Docker & Kubernetes - DevOps, CI/CD, GitHub Actions, CircleCI, Je...

17 Jan 2016 1,280

hdfs-stream-processing

Streaming data processing using Hadoop HDFS, Spark, Kafka, Minio, Elasticsearch

21 Jul 2024 1