lavoro Hadoop MapReduce il file che contiene i tag HTML

https://stackoverflow.com/questions/1842747

12-09-2019
|

Domanda

Ho un sacco di grandi file HTML e voglio eseguire un processo di Hadoop MapReduce su di loro per trovare le parole utilizzate più di frequente. Ho scritto sia il mio mapper e riduttore in Python e usato lo streaming Hadoop per eseguirli.

Ecco il mio mapper:

#!/usr/bin/env python

import sys
import re
import string

def remove_html_tags(in_text):
'''
Remove any HTML tags that are found. 

'''
    global flag
    in_text=in_text.lstrip()
    in_text=in_text.rstrip()
    in_text=in_text+"\n"

    if flag==True: 
        in_text="<"+in_text
        flag=False
    if re.search('^<',in_text)!=None and re.search('(>\n+)$', in_text)==None: 
        in_text=in_text+">"
        flag=True
    p = re.compile(r'<[^<]*?>')
    in_text=p.sub('', in_text)
    return in_text

# input comes from STDIN (standard input)
global flag
flag=False
for line in sys.stdin:
    # remove leading and trailing whitespace, set to lowercase and remove HTMl tags
    line = line.strip().lower()
    line = remove_html_tags(line)
    # split the line into words
    words = line.split()
    # increase counters
    for word in words:
       # write the results to STDOUT (standard output);
       # what we output here will be the input for the
       # Reduce step, i.e. the input for reducer.py
       #
       # tab-delimited; the trivial word count is 1
       if word =='': continue
       for c in string.punctuation:
           word= word.replace(c,'')

       print '%s\t%s' % (word, 1)

Ecco la mia riduttore:

#!/usr/bin/env python

from operator import itemgetter
import sys

# maps words to their counts
word2count = {}

# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()

    # parse the input we got from mapper.py
    word, count = line.split('\t', 1)
    # convert count (currently a string) to int
    try:
        count = int(count)
        word2count[word] = word2count.get(word, 0) + count
    except ValueError:
        pass

sorted_word2count = sorted(word2count.iteritems(), 
key=lambda(k,v):(v,k),reverse=True)

# write the results to STDOUT (standard output)
for word, count in sorted_word2count:
    print '%s\t%s'% (word, count)

Ogni volta che ho appena tubo un piccolo campione piccola stringa come 'ciao mondo ciao ciao mondo ...' Ho la corretta uscita di una graduatoria. Tuttavia, quando si tenta di utilizzare un piccolo file HTML, e provare a utilizzare gatto a tubo il codice HTML nel mio mapper, ottengo il seguente errore (input2 contiene un codice HTML):

rohanbk@hadoop:~$ cat input2 | /home/rohanbk/mapper.py | sort | /home/rohanbk/reducer.py
Traceback (most recent call last):
  File "/home/rohanbk/reducer.py", line 15, in <module>
    word, count = line.split('\t', 1)
ValueError: need more than 1 value to unpack

Qualcuno può spiegare perché sto ottenendo questo? Inoltre, qual è un buon modo per eseguire il debug di un programma di lavoro MapReduce?

Soluzione

È possibile riprodurre il bug anche con una:

echo "hello - world" | ./mapper.py  | sort | ./reducer.py

Il problema è qui:

if word =='': continue
for c in string.punctuation:
           word= word.replace(c,'')

Se word è un singolo segno di punteggiatura, come sarebbe il caso per l'ingresso di cui sopra (dopo che è diviso), allora è convertito in una stringa vuota. Quindi, basta spostare il controllo per una stringa vuota dopo la sostituzione.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow