bash recursive CSV parser [closed]

https://stackoverflow.com/questions/10569217

08-06-2021
|

문제

I have moved this question to code review

I've written a recursive descent parser in bash.

I'm wondering if you can point out something helpful to me. It supports backslash escapes and quoted fields with parse error reporting.

The script works like cut in some ways.. taking input from file or stdin allowing the user to select which line and which field they would like printed out. using options -l and -f respectively. A list of fields can be printed and a custom delimiter for that list can be specified using options --list '1 2 3 4 5' and --list-seperator $'\n' for example.

#!/bin/bash

shopt -s extglob; # we need maxiumum expression capabilities enabled

# option variables
declare list='' delim=$'\n' field='' lineNum=1;

while [[ ${1:0:1} = '-' ]]; do # Parse user arguments

    case $1 in
        -f)
            field=$2; shift 2;
        ;;
        -l)
            lineNum=$2; shift 2;
        ;;
        --list-seperator)
            delim="$2"; shift 2;
        ;;
        --list)
            list="$2"; shift 2;
        ;;
        *) break;;
    esac

done

# open a user supplied file on stdin
[[ -e "$1" ]] && exec 0<$1;

# data from 'read/getline'
declare input='';

# we are using sed to optimize input, the command just prints the desired line
read -r input < <(sed -n ${lineNum}p) 
# why doesn't the above work as a pipe into read?

# range of this line
declare strLen=${#input} value='';

# data processing variables
declare symbol='' value='';

# REGEX symbol "classes"
declare nothing='' comma='[,]' quote='["]' backslash='[\]' text='[^,\"]';

# integers:
declare -i iPos=-1 tPos=0;

# output array:
declare -a items=();

NextSymbol() {

    symbol="${input:$((++iPos)):1}"; # get next char from string

    (( iPos < strLen )); # return false if we are out of range

}

Accept() {

    [[ -z "$symbol" && -z "$1" ]] && return 0; # accept "nothing/empty"

    # if you can meld the above line into the next line
    # let me know: pc.wiz.tt@gmail.com; this is some kind of bug!
    # becare careful because expect expects 'nothing' to be empty.
    # that's why it says 'end of input'

    [[ "$symbol" =~ ^$1$ ]] && NextSymbol && return 0; # match symbol
}

Expect() {
    Accept "$1" && return
    local msg="$0: parse failure: line $lineNum: expected symbol: "
    echo "$msg'${1:-end of input}'" >&2;
    echo "$0: found: '$symbol'" >&2;
    exit 1;
}

value() {

    while Accept $text; do # symbol will not be what we expect here
        value+=${input:$((iPos-1)):1}; # so get what we expect
    done

    Accept $nothing && { # this will only happen at end of the string
        value+=${input:$((iPos-1)):1} # get the last char
        pushValue; # push the data into the array
    }

}

pushValue() {
    items[tPos++]=$value;
    value=''; # clear value!
}

quote() {

    until [[ $symbol =~ $quote || -z $symbol ]]; do
        value+=$symbol;
        NextSymbol;
    done

    Expect $quote;

}

line() {

    Accept $quote && {
        quote
        line;
    }

    Accept $backslash && {
        value+=$symbol;
        NextSymbol;
        value;
        line;
    }

    Accept $comma && {
        pushValue;
        line;
    }

    Accept $text && {
        value=${input:$((iPos-1)):1};
        value;
        line;
    }

}

NextSymbol;
    line;
        Expect $nothing


[[ $field ]] && { # want field    
    echo "${items[field-1]}" # print the item
                             # (index justified for human use)
    exit;
}

[[ $list ]] && { # want list
    for index in $list; do
        echo -n "${items[index-1]}${delim}" # print the list 
                                            # (index justified for human use)
    done
    exit;
}

exit 1; # no field or list

해결책

Seriously, the way to make it faster is to write in in a language that isn't interpreted.

A shell may be a poor choice for a parser for the same reason I wouldn't write an accounting package in 6502 assembly language, or an operating system in COBOL.

And, honestly, awk isn't going to be that much better. It's ideal for sequential text processing but it would be a stretch to apply it to something as complex as a parser.

Sometimes, you just have to rethink the tools you're using. If you can't do it is a compiled language like C, at least give some thought to a byte-code oriented language like Python.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow