Twitter Analyzing

Java example of analyzing twitter data with Apache Hadoop MapReduce.

TwitterFollowerCounter - evaluate number of followers per user
TwitterAvgFollowers - evaluate average number of followers per all users
TwitterTopFollowers - evaluate to 50 users by number of followers
TwitterFollowerCounterGroupByRanges - evaluate number of followers in ranges [1 .. 10], [11 .. 100], [101 .. 1000], ...

As you can see in build.log and result.log, the most followed person is user 16409683 - @britneyspears (according to 2011).

How-to run

git clone https://github.com/pahaz/twitter-hadoop-example.git
cd twitter-hadoop-example

mvn compile
mvn package

hadoop fs -rm -r /user/s0073/count || echo "No"
hadoop fs -rm -r /user/s0073/avg || echo "No"
hadoop fs -rm -r /user/s0073/top || echo "No"
hadoop fs -rm -r /user/s0073/range || echo "No"

hadoop jar target/hhd-1.0-SNAPSHOT.jar TwitterFollowerCounter /data/twitter/twitter_rv.net /user/s0073/count
hadoop jar target/hhd-1.0-SNAPSHOT.jar TwitterAvgFollowers /user/s0073/count /user/s0073/avg
hadoop jar target/hhd-1.0-SNAPSHOT.jar TwitterTopFollowers /user/s0073/count /user/s0073/top
hadoop jar target/hhd-1.0-SNAPSHOT.jar TwitterFollowerCounterGroupByRanges /user/s0073/count /user/s0073/range

see build.log

Source Data format

/data/twitter/twitter_rv.net contains strings like user_id \t follower_id (without spaces around \t)

Generated Data formats

/user/s0073/count - contains strings like user_id \t number_of_followers (without spaces around \t)
/user/s0073/avg - contains one string 0 \t average_number_of_followers (without spaces around \t, 0 is fictive key)
/user/s0073/top - contains 50 strings like 0 \t user_id \t number_of_followers (without spaces around \t, 0 is fictive key)
/user/s0073/range - contains strings like range_number \t number_of_users (without spaces around \t, range_number example: 10 - [1 .. 10], 11 - [11 .. 100], ...)

How-to view results

# hadoop fs -cat /user/s0073/count/*
hadoop fs -cat /user/s0073/avg/*
hadoop fs -cat /user/s0073/top/*
hadoop fs -cat /user/s0073/range/*

see result.log

Time

Hadoop work on 8x nodes: 2x Dual-core AMD Opteron 285 2.6 GHz, 8 GB RAM, 150 GB HDD.

Files *

/data/twitter/twitter_rv.net - 24.4 G
/user/s0073/count - 35.6 M * 13 (parts 00000 - 00012)

Time *

TwitterFollowerCounter - 22:32:52 - 23:02:35 (0:29:43)
TwitterAvgFollowers - 23:02:43 - 23:04:24 (0:01:41)
TwitterTopFollowers - 23:04:31 - 23:06:50 (0:02:19)
TwitterFollowerCounterGroupByRanges - 23:06:57 - 23:08:06 (0:01:09)

Other

Sometime you need up hadoop job priority. Use this command:

mapred job -set-priority <job_id> VERY_HIGH

Related Projects

HdrLogProcessing

Utilities for HDR Histogram logs manipulation

12 Aug 2015 30

dead-salmon-brain

Apache Spark based framework for analysis A/B experiments

07 Nov 2021 11

UserActionAnalyzePlatform

电商用户行为分析大数据平台

21 Jun 2018 959

HiBench

HiBench is a big data benchmark suite.

12 Jun 2012 1,435

MapReduce-java-implementation

Projet shavadoop

07 Nov 2016 8

bespin

Reference implementations of data-intensive algorithms in MapReduce and Spark

26 Nov 2015 81

jvm-profiler

JVM Profiler Sending Metrics to Kafka, Console Output or Custom Reporter

05 Jan 2018 1,780

jsoniter-scala

Scala macros for compile-time generation of safe and ultra-fast JSON codecs + circe booster

26 Sep 2017 739

spark

Apache Spark - A unified analytics engine for large-scale data processing

25 Feb 2014 38,255

hadoop-wordcount

Word Count using Apache Hadoop 3+

17 Nov 2019 6