How do I parse basic arithmetic (eg “5+5”) using a simple recursive descent parser in C++?

https://stackoverflow.com/questions/10378903

04-06-2021
|

문제

This has been on my mind for a while. I'm intrigued by recursive descent parsers, and would like to know how to implement one. What I want is a simple parser that will understand simple arithmetic such as "5+5", or "(5+5)*3".

I figure the first step is to write a 'tokenizer', which takes the entire input string and breaks it into many substrings. This part I have done (I've even had to ask about it here. You don't have to follow the link if you don't want to, since I'm posting the relavent code here as well.) With this tokenizer of mine, I end up with a vector of strings, or tokens. Now, the hard part: I'd like to parse those tokens.

I've read the Wikipedia article on recursive descent parsers. I do understand the overall concept, but as always, the implementation is a bit confusing. In that article, there is a C implementation of a recursive descent parser for a very simple programming language, also discussed in the article. I studied that code as best I could, and tried to basically write the same thing, but for my program. Below is that code.

What I am really confused about is what this parser does. It seems to go through the program and 'expect' certain parts of the grammar. But once it gets there, what does it do? For example, here is one function from the Wikipedia code that is supposed to parse a 'term':

void term(void) {
    factor();
    while (sym == times || sym == slash) {
        getsym();
        factor();
    }
}

This is meant to parse this grammar:

term = factor {("*"|"/") factor} .

which makes sense. But what does it do with the actual term? Let's say that term is simply "6", or was "3*2" and came out to be of value 6. How would it incorporate that into the rest of the input? Shouldn't term() return a double instead of void (to return the 6)? Or is it done some other way?

Also, what would be the difference between getting a parser like this to output code, and to immediately act upon the input (ie compiler vs. interpreter)? Are those two (in this example at least) theoretically implemented the same way, or are they fundamentally different?

Any input is welcome. Here is my code thus far:

#include <iostream>
#include <string>
#include <vector>
#include <ctype.h>
#include <sstream>

using namespace std;

vector<string> symbolize(string);
bool accept(string);
void getSymbol();
void error(string s);
bool expect(string);
void expression();
void term();
void factor();

int currentPosition = -1;
string symbol;
vector<string> symbols;

int main(int argc, const char * argv[])
{

    string input;
    getline(cin,input);

    symbols = symbolize(input);
    getSymbol();
    expression();


    return 0;
}

void factor(){
    if(isdigit(symbol.c_str()[0])){}
    else if(accept("(")){
        expression();
        expect(")");
    }
    else {
        error("Syntax error");
    }

}

void term(){
    factor();
    while(symbol=="*"||symbol=="/"){
        getSymbol();
        factor();
    }
}

void expression(){
    if(symbol == "+" || symbol == "-") getSymbol();
    term();
    while(symbol == "+" || symbol == "-"){
        getSymbol();
        term();
    }
}

void error(string s){
    cout << endl << "ERROR: " << s << endl;
}

void getSymbol(){
    currentPosition++;
    if(currentPosition>=symbols.size())error("Unexpectedly reached end of input");

}

bool expect(string s){
    if(accept(s))return true;
    else error("Expected '" + s + "'");
    return false;
}

bool accept(string s){
    if(s==symbol){getSymbol();return true;}
    return false;
}

// Takes a string and breaks it into substrings
vector<string> symbolize(string input){
    int position = 0;
    char c;
    //stringstream s;
    vector<string> symbols;
    enum symbolType {TEXT,OPERATOR}symbolType,charType;

    while(position < input.size()){
        stringstream s;
        c = input.at(position);
        if(isalnum(c))symbolType = TEXT;
        else symbolType = OPERATOR;
        charType = symbolType;

        while(symbolType == charType){
            s << c;
            position++;
            if(position>=input.length())break;
            c = input.at(position);
            if(isspace(c)||c=='\n'){position++; break;}
            if(isalnum(c)) charType = TEXT;
            else charType = OPERATOR;
        }

        symbols.push_back(s.str());
    }

    return symbols;
}

Edit: I should mention that my code always prints: ERROR: syntax error, from the factor() function.

해결책

The wikipedia article contains a very complete looking parser (but no lexer!) that does nothing else.

For actually interpreting the result, the general idea is, each parsing function passes back the partly-interpreted result to its parent/caller. The result could be of a different type for each rule in the tree. If you are writing a parser for a complete language, a partially interpreted result could be a simple constant (which could be of any type) or it could be an entire function (which you may need to later compile). In the case of an equation parser, each rule would simply perform the required operation on the elements it gets from calling other functions, and pass the result back up to the function that called it.

There are two approaches that come to mind:

Have each function accept a something* result parameter. In the case of a simple equation parser this would probably be float* result for all of the elements.
Simply return the result, by changing all the functions from void rule_x()... to float rule_x()....

In either case you will need some way to deal with errors. If you are in C you have no exceptions so you would probably be best using option 1 and using the return value to indicate success. Then there would be lots of

if(!expression(&result)) return 0;

But in C++ you could wrap the parse in an exception handler, and throw an exception on error that aborted the rest of the parse.

Things get much more interesting when you want to, say compile an entire language with optimisation or JIT, and attempt to recover gracefully from syntax errors and keep parsing.

The book to get on the subject is the dragon book.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow