Sep 30, 2016 - 19 min - Uploaded by Andy GunaAn example code on converting DFA to C code. Name: NFA to DFA Conversion // Description:It is a program for NFA ( Non-deterministic Finite Automata) to DFA (Deterministic Finite Auctomata ) Conversion using the Subset Construction Algorithm. // By: Ritin (from psc cd) // // Inputs:NFA states, inputs, transitions // // Returns:DFA transition table.
I needed a C++ implementation of NFA to DFA conversion for my compilers class and could not find a simple implementation on the web so I thought I would provide one. A DFA () is a finite state machine where from each state and a given input symbol, the next possible state is uniquely determined. On the other hand, an NFA () can move to several possible next states from a given state and a given input symbol. However, this does not add any more power to the machine.
It still accepts the same set of languages, namely the regular languages. Digging Jim Game. It is possible to convert an NFA to an equivalent DFA using the. The intuition behind this scheme is that an NFA can be in several possible states at any time. We can simulate it with a DFA whose states correspond to sets of states of the underlying NFA.
Take a look at the. Note however that it is not designed for performance. It is my first attempt at a simple, readable and easy-to-understand implementation and I hope I succeeded in that regard. Here’s the format of NFA.txt: N M F a1 a2 af T s1 y1 T1 t1 t2 tt1 s2 y2 T2 t1 t2 tt2: The first line contains two integers N & M, representing the number of states and the number of input symbols respectively. The states are implicity 0, 1,, N-1 and the input symbols are 1, 2,, M.
The integer 0 is used to represent epsilon. The second line starts with an integer F, denoting the number of final states in the NFA, followed by F integers which represent the final states. The third line contains an integer T denoting the number of transitions / no. Of lines following this one in NFA.txt. T lines follow.
Each line represents a transition and starts with three integers, denoting the previous state si, the input symbol yi and the no. Of states ti the NFA goes to from the previous state on that input symbol. Ti integers follow representing the next states NFA can go to from the previous state si on the input symbol yi.
Download Borland Delphi 5 Completo here. For example, NFA in example 1 on this site: is represented as: 4 2 2 0 1 4 0 1 2 1 2 1 1 2 1 2 2 2 2 1 3 3 1 2 1 2 The only critical thing is that the symbols a & b have been relabeled as 1 & 2 respectively. Hope this helps.
“DFA.txt” will have the following format: N M F a1 a2 af s1 y1 t1 s2 y2 y2:: The first line contains two integers N and M, the number of states in the equivalent DFA and the number of moves (alphabet size). M will have the same value as that for the underlying NFA, except that 0 (epsilon moves) won’t be used so the available moves will be 1, 2,, M. The second line starts with an integer F, denoting the number of final states in the DFA followed by F final states, a1, a2,, af (0. It’s a console program. There’s nothing to take screenshots of.
It reads in the representation of an NFA from a file “NFA.txt” and writes out the representation of the corresponding DFA to a file “DFA.txt”. Nothing appears on the screen as such. You need to write the representation of an NFA in a file “NFA.txt” stored in the same folder as the program executable. For an example, please take a look at my reply on comment by “Pragwal G”. I made a design decision to use numbers to represent states and input symbols. It made indexing into arrays easier. Anyway, it’s very easy to modify the code to accept alphabets to represent states and input symbols.
You can easily create a mapping a->1, b->2, by adding (1-‘a’) to an alphabet. The reverse mapping can be created by adding (‘a’-1) to a number.
If you want to submit NFA with alphabets representing states/symbols, you can map alphabets to numbers while reading the file. Similarly, if you want to receive DFA with alphabets representing states/symbols, you can map numbers to alphabets while writing the file. Reply if you encounter any problem. Hey thank you very very much for the quick reply. I have to submit this homework at midnight and now is 5 pm. Thank you for the code.
I used your code for reading the NFA file, converting it to DFA, and then used the other class DFA to display outputs “Accept or Reject”. Everything works fine. But i dont know how to test it. For example, i want to give the representation of an NFA which contains the empty transitions, and then want to test the program on different inputs. Sorry for asking you again, to test the program with alphabets e.g abb, abba, aabbb etc what piece of code should i change?? Or i just need to change the NFA file?
Im just into pressure of the homework, and i also have to report it and explain the flow execution. Thank you again for the code, and if you can please reply me quick because i have to submit the homework within 5 hours. I just tried it and it works.
Here’s a sample execution: Enter a string (‘.’ to exit): 1 String accepted. Enter a string (‘.’ to exit): 1 2 String accepted. Enter a string (‘.’ to exit): 2 String rejected. Enter a string (‘.’ to exit): 1 2 1 2 String accepted.
Enter a string (‘.’ to exit): 1 2 1 String accepted. Enter a string (‘.’ to exit): 1 2 2 String rejected.
Enter a string (‘.’ to exit):. Make sure to give spaces between integers. Input symbols are integers from 1 through M. Don’t give anything else. Epsilon closure of a state (or a set of states) is the set of states which can be reached by using epsilon moves, starting from that (or those) state(s). The code has two overloads to calculate epsilon closure, one for a single state and one for a set of states.
The latter basically calls the former for each of the component states. We pass in a bitset (you can think of it as a boolean array), where a set bit indicates that the state corresponding to its index falls in the epsilon closure, and the function is supposed to fill in that bitset. For a given state, we check which states are reachable from it using epsilon moves, set their corresponding bit and recursively call the function for those states.
One fundamental aspect of the lexer vs parser issue is that lexers are based on finite automata (FSA), or more precisely finite transducers (FST). Most parsing formalisms (not just Context-Free) are closed under intersection with FSA or application of FST. Hence using the simpler regular expression based formnalism for lexer does not increase the complexity of syntactic structures of the more complex parser formalisms. This is an absolutely major modularity issue when defining structure and semantics of languages, happily ignored by the high voted answers. – Dec 27 '14 at 11:39. What parsers and lexers have in common: • They read symbols of some alphabet from their input. Hint: The alphabet doesn't necessarily have to be of letters.
But it has to be of symbols which are atomic for the language understood by parser/lexer. • Symbols for the lexer: ASCII characters. • Symbols for the parser: the particular tokens, which are terminal symbols of their grammar. • They analyse these symbols and try to match them with the grammar of the language they understood. And here's where the real difference usually lies. See below for more.
• Grammar understood by lexers: regular grammar (Chomsky's level 3). • Grammar understood by parsers: context-free grammar (Chomsky's level 2). • They attach semantics (meaning) to the language pieces they find.
• Lexers attach meaning by classifying lexemes (strings of symbols from the input) as the particular tokens. All these lexemes: *, ==. They both take a series of symbols from the alphabet they recognize.
For lexer, this alphabet consists just of plain characters. For parser, the alphabet consists of terminal symbols, whatever they are defined.
They could be characters, too, if you don't use lexer and use one-character identifiers and one-digit numbers etc (quite useful at first stages of developement). But they're usually tokens (lexical classes) because tokens are a good abstraction: you can change the actual lexemes (strings) they stand for, and parser doesn't see the change. – Aug 2 '12 at 1:02 4. For example, you can use a terminal symbol STMT_END in your syntax (for the parser) to denote the end of instructions.
Now you can have a token with the same name associated with it, generated by the lexer. But you can change the actual lexeme it stands for. You can define STMT_END as; to have C/C++-like source code. Or you can define it as end to have it somehow similar to Pascal-style. Or you can define it as just ' n' to end the instruction with the end of line, like in Python. But the syntax of instruction (and the parser) stays unchanged:-) Only lexer needs to be changed. – Aug 2 '12 at 1:08 12.
Yes, they are very different in theory, and in implementation. Lexers are used to recognize 'words' that make up language elements, because the structure of such words is generally simple. Regular expressions are extremely good at handling this simpler structure, and there are very high-performance regular-expression matching engines used to implement lexers.
Parsers are used to recognize 'structure' of a language phrases. Such structure is generally far beyond what 'regular expressions' can recognize, so one needs 'context sensitive' parsers to extract such structure. Context-sensitive parsers are hard to build, so the engineering compromise is to use 'context-free' grammars and add hacks to the parsers ('symbol tables', etc.) to handle the context-sensitive part. Neither lexing nor parsing technology is likely to go away soon. They may be unified by deciding to use 'parsing' technology to recognize 'words', as is currently explored by so-called scannerless GLR parsers.
That has a runtime cost, as you are applying more general machinery to what is often a problem that doesn't need it, and usually you pay for that in overhead. Where you have lots of free cycles, that overhead may not matter. If you process a lot of text, then the overhead does matter and classical regular expression parsers will continue to be used. The theory is different, because it has been proposed by many different people and use different terminology and algorithms.
But if you look them closely, you can spot the similarities. For example, the problem of left recursion is very similar to the problem of non-determinism in NFAs, and removing left recursion is similar to removing non-determinism and converting NFA into DFA. Tokens are sentences for the tokenizer (output), but alphabetical symbols for the parser (input).
I don't deny the differences (Chomsky levels), but similarities help a lot in design. – Feb 19 '12 at 18:24 1.
My officemate was into category theory. He showed how the categorical theory notion of sheaves covered all kinds of pattern matching, and was able to derive LR parsing from an abstract categorical specification. So in fact, if you go abstract enough, you can find such commonalities. The point of category theory is you can often abstract 'all the way up'; I'm sure you could build a category theory parser that erased the differences.
But any practical uses of it have to instantiate down to the specific problem domain, and then the differences show up as real. – Oct 30 '13 at 0:55. When is lexing enough, when do you need EBNF? EBNF really doesn't add much to the power of grammars.
It's just a convenience / shortcut notation / 'syntactic sugar' over the standard Chomsky's Normal Form (CNF) grammar rules. For example, the EBNF alternative: S -->A B you can achieve in CNF by just listing each alternative production separately: S -->A // `S` can be `A`, S -->B // or it can be `B`. The optional element from EBNF: S -->X? You can achieve in CNF by using a nullable production, that is, the one which can be replaced by an empty string (denoted by just empty production here; others use epsilon or lambda or crossed circle): S -->B // `S` can be `B`, B -->X // and `B` can be just `X`, B -->// or it can be empty. A production in a form like the last one B above is called 'erasure', because it can erase whatever it stands for in other productions (product an empty string instead of something else). Zero-or-more repetiton from EBNF: S -->A* you can obtan by using recursive production, that is, one which embeds itself somewhere in it. It can be done in two ways.
First one is left recursion (which usually should be avoided, because Top-Down Recursive Descent parsers cannot parse it): S -->S A // `S` is just itself ended with `A` (which can be done many times), S -->// or it can begin with empty-string, which stops the recursion. Knowing that it generates just an empty string (ultimately) followed by zero or more As, the same string ( but not the same language!) can be expressed using right-recursion: S -->A S // `S` can be `A` followed by itself (which can be done many times), S -->// or it can be just empty-string end, which stops the recursion. And when it comes to + for one-or-more repetition from EBNF: S -->A+ it can be done by factoring out one A and using * as before: S -->A A* which you can express in CNF as such (I use right recursion here; try to figure out the other one yourself as an exercise): S -->A S // `S` can be one `A` followed by `S` (which stands for more `A`s), S -->A // or it could be just one single `A`. Knowing that, you can now probably recognize a grammar for a regular expression (that is, regular grammar) as one which can be expressed in a single EBNF production consisting only from terminal symbols. More generally, you can recognize regular grammars when you see productions similar to these: A -->// Empty (nullable) production (AKA erasure). B -->x // Single terminal symbol. C -->y D // Simple state change from `C` to `D` when seeing input `y`.
E -->F z // Simple state change from `E` to `F` when seeing input `z`. G -->G u // Left recursion. H -->v H // Right recursion.
That is, using only empty strings, terminal symbols, simple non-terminals for substitutions and state changes, and using recursion only to achieve repetition (iteration, which is just linear recursion - the one which doesn't branch tree-like). Nothing more advanced above these, then you're sure it's a regular syntax and you can go with just lexer for that. But when your syntax uses recursion in a non-trivial way, to produce tree-like, self-similar, nested structures, like the following one: S -->a S b // `S` can be itself 'parenthesized' by `a` and `b` on both sides. S -->// or it could be (ultimately) empty, which ends recursion.
Then you can easily see that this cannot be done with regular expression, because you cannot resolve it into one single EBNF production in any way; you'll end up with substituting for S indefinitely, which will always add another as and bs on both sides. Lexers (more specifically: Finite State Automata used by lexers) cannot count to arbitrary number (they are finite, remember?), so they don't know how many as were there to match them evenly with so many bs. Grammars like this are called context-free grammars (at the very least), and they require a parser.
Context-free grammars are well-known to parse, so they are widely used for describing programming languages' syntax. But there's more. Sometimes a more general grammar is needed -- when you have more things to count at the same time, independently. For example, when you want to describe a language where one can use round parentheses and square braces interleaved, but they have to be paired up correctly with each other (braces with braces, round with round). This kind of grammar is called context-sensitive. You can recognize it by that it has more than one symbol on the left (before the arrow). For example: A R B -->A S B You can think of these additional symbols on the left as a 'context' for applying the rule.
There could be some preconditions, postconditions etc. For example, the above rule will substitute R into S, but only when it's in between A and B, leaving those A and B themselves unchanged. This kind of syntax is really hard to parse, because it needs a full-blown Turing machine. It's a whole another story, so I'll end here. To answer the question as asked (without repeating unduly what appears in other answers) Lexers and parsers are not very different, as suggested by the accepted answer. Both are based on simple language formalisms: regular languages for lexers and, almost always, context-free (CF) languages for parsers.
They both are associated with fairly simple computational models, the finite state automaton and the push-down stack automaton. Regular languages are a special case of context-free languages, so that lexers could be produced with the somewhat more complex CF technology. But it is not a good idea for at least two reasons. A fundamental point in programming is that a system component should be buit with the most appropriate technology, so that it is easy to produce, to understand and to maintain.
The technology should not be overkill (using techniques much more complex and costly than needed), nor should it be at the limit of its power, thus requiring technical contortions to achieve the desired goal. That is why 'It seems fashionable to hate regular expressions'. Though they can do a lot, they sometimes require very unreadable coding to achieve it, not to mention the fact that various extensions and restrictions in implementation somewhat reduce their theoretical simplicity. Lexers do not usually do that, and are usually a simple, efficient, and appropriate technology to parse token. Using CF parsers for token would be overkill, though it is possible. Another reason not to use CF formalism for lexers is that it might then be tempting to use the full CF power.
But that might raise sructural problems regarding the reading of programs. Fundamentally, most of the structure of program text, from which meaning is extracted, is a tree structure.
It expresses how the parse sentence (program) is generated from syntax rules. Semantics is derived by compositional techniques (homomorphism for the mathematically oriented) from the way syntax rules are composed to build the parse tree. Hence the tree structure is essential. The fact that tokens are identified with a regular set based lexer does not change the situation, because CF composed with regular still gives CF (I am speaking very loosely about regular transducers, that transform a stream of characters into a stream of token). However, CF composed with CF (via CF transducers. Sorry for the math), does not necessarily give CF, and might makes things more general, but less tractable in practice.
So CF is not the appropriate tool for lexers, even though it can be used. One of the major differences between regular and CF is that regular languages (and transducers) compose very well with almost any formalism in various ways, while CF languages (and transducers) do not, not even with themselves (with a few exceptions). (Note that regular transducers may have others uses, such as formalization of some syntax error handling techniques.) BNF is just a specific syntax for presenting CF grammars. EBNF is a syntactic sugar for BNF, using the facilities of regular notation to give terser version of BNF grammars.
It can always be transformed into an equivalent pure BNF. However, the regular notation is often used in EBNF only to emphasize these parts of the syntax that correspond to the structure of lexical elements, and should be recognized with the lexer, while the rest with be rather presented in straight BNF. But it is not an absolute rule. To summarize, the simpler structure of token is better analyzed with the simpler technology of regular languages, while the tree oriented structure of the language (of program syntax) is better handled by CF grammars. I would suggest also looking. But this leaves a question open: Why trees? Trees are a good basis for specifying syntax because • they give a simple structure to the text • there are very convenient for associating semantics with the text on the basis of that structure, with a mathematically well understood technology (compositionality via homomorphisms), as indicated above.
It is a fundamental algebraic tool to define the semantics of mathematical formalisms. Hence it is a good intermediate representation, as shown by the success of Abstract Syntax Trees (AST).
Note that AST are often different from parse tree because the parsing technology used by many professionals (Such as LL or LR) applies only to a subset of CF grammars, thus forcing grammatical distorsions which are later corrected in AST. This can be avoided with more general parsing technology (based on dynamic programming) that accepts any CF grammar.
Statement about the fact that programming languages are context-sensitive (CS) rather than CF are arbitrary and disputable. The problem is that the separation of syntax and semantics is arbitrary.
Checking declarations or type agreement may be seen as either part of syntax, or part of semantics. The same would be true of gender and number agreement in natural languages. But there are natural languages where plural agreement depends on the actual semantic meaning of words, so that it does not fit well with syntax. Many definitions of programming languages in denotational semantics place declarations and type checking in the semantics. So stating as done by that CF parsers are being hacked to get a context sensitivity required by syntax is at best an arbitrary view of the situation. It may be organized as a hack in some compilers, but it does not have to be. Also it is not just that CS parsers (in the sense used in other answers here) are hard to build, and less efficient.
They are are also inadequate to express perspicuously the kinf of context-sensitivity that might be needed. And they do not naturally produce a syntactic structure (such as parse-trees) that is convenient to derive the semantics of the program, i.e. To generate the compiled code. There are a number of reasons why the analysis portion of a compiler is normally separated into lexical analysis and parsing ( syntax analysis) phases. • Simplicity of design is the most important consideration.
The separation of lexical and syntactic analysis often allows us to simplify at least one of these tasks. For example, a parser that had to deal with comments and white space as syntactic units would be. Considerably more complex than one that can assume comments and white space have already been removed by the lexical analyzer. If we are designing a new language, separating lexical and syntactic concerns can lead to a cleaner overall language design.
• Compiler efficiency is improved. A separate lexical analyzer allows us to apply specialized techniques that serve only the lexical task, not the job of parsing. In addition, specialized buffering techniques for reading input characters can speed up the compiler significantly. • Compiler portability is enhanced.
Input-device-specific peculiarities can be restricted to the lexical analyzer. Resource___ Compilers (2nd Edition) written by- Alfred V. Abo Columbia University Monica S. Lam Stanford University Ravi Sethi Avaya Jeffrey D. Ullman Stanford University.