티스토리 뷰

Cloud&BigData/하둡(Hadoop)

하둡 맵리듀스 Join 활용 퀴즈!!

미니~ 2016. 1. 20. 07:15

본 퀴즈는 University of California, San Diego의 Super Computer Center, Paul Rodriguez님의 강의에 포함된 내용이다.

해당 퀴즈에 대한 답은 올려놓지 않을 계획이므로 아래 내용을 잘 따라하고 직접 풀어보기 바란다.

하둡 맵리듀스 Join 활용 예제 를 참고하면 쉽게 구현할 수 있을 것이다.

아래 예제에 따라 데이터 파일을 생성하고 조인하는 맵리듀스를 파이썬으로 구현해 보도록 하자.

1. 퀴즈에 사용할 데이터 파일을 생성하는 다음 파이썬 소스를 make_join2data.py 파일로 저장한다.

#!/usr/bin/env python
import sys

# --------------------------------------------------------------------------
#  (make_join2data.py) Generate a random combination of titles and viewer counts, or channels
# this is a simple version of a congruential generator, 
#   not a great random generator but enough  
# --------------------------------------------------------------------------

chans   = ['ABC','DEF','CNO','NOX','YES','CAB','BAT','MAN','ZOO','XYZ','BOB']
sh1 =['Hot','Almost','Hourly','PostModern','Baked','Dumb','Cold','Surreal','Loud']
sh2 =['News','Show','Cooking','Sports','Games','Talking','Talking']
vwr =range(17,1053)

chvnm=sys.argv[1]  #get number argument, if its n, do numbers not channels,

lch=len(chans)
lsh1=len(sh1)
lsh2=len(sh2)
lvwr=len(vwr)
ci=1
s1=2
s2=3
vwi=4
ri=int(sys.argv[3])
for i in range(0,int(sys.argv[2])):  #arg 2 is the number of lines to output

if chvnm=='n':  #no numuber
        print('{0}_{1},{2}'.format(sh1[s1],sh2[s2],chans[ci]))
    else:
        print('{0}_{1},{2}'.format(sh1[s1],sh2[s2],vwr[vwi])) 
    ci=(5*ci+ri) % lch   
    s1=(4*s1+ri) % lsh1
    s2=(3*s1+ri+i) % lsh2
    vwi=(2*vwi+ri+i) % lvwr
 
    if (vwi==4): vwi=5

2. 파이썬 소스를 실행하는 스크립트를 make_data_join2.txt 파일로 만들고 이를 실행한다.

> sh make_data_join2.txt

3. 생성된 데이터 파일을 살펴보자.

join2_gennum*.txt 파일은 <TV show, count> 형태로 TV 프로그램과 시청수를 나타낸다.

join2_genchan*.txt 파일은 <TV show, channel> 형태로 TV 프로그램과 해당 프로그램의 TV 채널을 나타내고 있다.

4. 생성된 파일들을 HDFS에 올린다.

5. 이제 다음 과제를 수행하는 맵리듀스를 구현해보자.

"ABC 채널에서 방송되는 프로그램의 전체 시청자 수는 얼마일까?"

이 과제를 SQL 형태로 만들어보면 다음과 같다.

select sum( viewer count) from File A, File B where FileA.TV show = FileB.TV show and FileB.Channel='ABC' grouped by TV show

최종 결과 파일의 앞부분 일부는 다음과 같이 나오면 된다.

참고

하둡 맵리듀스 Join 활용 예제 를 참조해서 먼저 스스로 구현해 보고 잘 안될 경우, 아래 힌트를 참고해서 구현해 보기 바란다.

join2_mapper.py 구현

read lines, and split lines into key & value if value is ABC or if value is a digit print it out

다음 명령어로 결과값을 먼저 확인해 볼 수 있다.

> cat join2_gen*.txt | ./join2_mapper.py | sort

join2_reducer.py 구현

read lines and split lines into key & value if a key has changed (and it's not the first input) then check if ABC had been found and print out key and running total, if value is ABC then set some variable to mark that ABC was found (like abc_found = True) otherwise keep a running total of viewer counts

다음 명령어로 결과값을 먼저 확인해 볼 수 있다.

> cat join2_gen*.txt | ./join2_mapper.py | sort | ./join2_reducer.py

하둡 스트리밍으로 최종 명령어의 형식은 다음과 같다.

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \ -input /user/cloudera/input/join2*.txt \ -output /user/cloudera/output_join2 \ -mapper /home/cloudera/join2_mapper.py \ -reducer /home/cloudera/join2_reducer.py

'Cloud&BigData > 하둡(Hadoop)' 카테고리의 다른 글

하둡 맵리듀스 Join 활용 예제 (0)	2016.01.18
하둡 스트리밍을 활용한 파이썬 word counting 예제~ (0)	2016.01.15
HDFS 명령어 테스트~ (0)	2016.01.13
Pig 두번째 예제 살펴보기~ (0)	2016.01.11
HBase 예제 살펴보기~ (0)	2016.01.08

공지사항

최근에 올라온 글

최근에 달린 댓글

Total

Today

Yesterday

링크

TAG more

« 2024/04 »
일	월	화	수	목	금	토
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

글 보관함

미니의 꿈꾸는 독서, 그리고 프로그래밍 이야기

티스토리 뷰

하둡 맵리듀스 Join 활용 퀴즈!!

참고

join2_mapper.py 구현

join2_reducer.py 구현

'Cloud&BigData > 하둡(Hadoop)' 카테고리의 다른 글

티스토리툴바