Skip to content
Sign upLog in
← Back to Community

Make a Full Lexer in Python!

Profile icon
ELDER054

What is a Lexer?

A lexer is an analyzer that moves through your code looking at each character, and trying to create tokens out of them
This input

int a =5*5

Can be turned into

[(‘KeyWord’, ‘int’), (‘ID’, ‘a’), (‘assign’, ‘=‘), (‘num’, 5), (‘OP’, ‘*’), (‘num’, 5)]

by the Lexer you will learn how to create.

What if I Have Problems?

If you have trouble understanding something or you get errors, tell me and I’ll try my best to tell you what’s wrong.

Let’s Get Started!

First, open a new python repl with whatever name you choose.
Then create a function lex. This will be our function that basically does everything :).
Then, make a variable code set to input(). Make sure the code initialization is not in the function.
And call lex on code.

def lex(line): code = input() lex(code)

After this set a new variable, in lex, and name it count or lexeme_count or something, set it to 0.

def lex(line): lexeme_count = 0 code = input() lex(code)

The lexeme_count variable is going to keep track of the chars you have already scanned.
Once you have that code, add a while loop saying that if chars you have scanned is less than length of line, keep scanning.

def lex(line): lexeme_count = 0 while lexeme_count < len(line): lexeme_count += 1 code = input() lex(code)

We will then make it more powerful by knowing what each lexeme is.

def lex(line): lexeme_count = 0 while lexeme_count < len(line): lexeme = line[lexeme_count] lexeme_count += 1 code = input() lex(code)

Then we can tell what the type is by using an if-elif-else statement to check the type of lexeme.
Make sure to move the lexeme_count += 1 part into the else.

def lex(line): lexeme_count = 0 while lexeme_count < len(line): lexeme = line[lexeme_count] if lexeme.isdigit(): elif lexeme.isalpha(): else: lexeme_count += 1 code = input() lex(code)

Let’s fill in the blank conditional blocks.

def lex(line): lexeme_count = 0 while lexeme_count < len(line): lexeme = line[lexeme_count] if lexeme.isdigit(): typ, tok, consumed = lex_num(line[lexeme_count:]) lexeme_count += consumed elif lexeme.isalpha(): typ, tok, consumed = lex_str(line[lexeme_count:]) lexeme_count += consumed else: lexeme_count += 1 code = input() lex(code)

Whoa, Slow Down! What’s Going on?

What we’re doing here, is we are making three variables. One for the type of each token, one for the token itself, and one for the amount of lexemes ‘consumed’, ‘eaten’, or ‘scanned’
Then we are assigning those variables to a function call which takes the rest of the line, and gets the rest of the token. We do this both for digits and strings
After this, we change the lexeme_count by the amount of chars consumed so they keep up with each other.

Is this it?

This is certainly not the full lexical analyzer, so let’s add some identifier lexing!
Once we have finished with that, we can scan for literals, conditionals, operators, keywords, etc

Let’s Lex Some Identifiers!

Add another elif to the if-elif-else statement this will check if lexeme is equal to a a letter of the alphabet.

def lex(line): lexeme_count = 0 while lexeme_count < len(line): lexeme = line[lexeme_count] if lexeme.isdigit(): typ, tok, consumed = lex_num(line[lexeme_count:]) lexeme_count += consumed elif lexeme == ‘“‘ or lexeme == “‘“: typ, tok, consumed = lex_str(line[lexeme_count:]) lexeme_count += consumed elif lexeme.isalpha(): else: lexeme_count += 1 code = input() lex(code)

In this elif, we need to mirror what we did earlier, but with a call to a different function; lex_id().

def lex(line): lexeme_count = 0 while lexeme_count < len(line): lexeme = line[lexeme_count] if lexeme.isdigit(): typ, tok, consumed = lex_num(line[lexeme_count:]) lexeme_count += consumed elif lexeme == ‘“‘ or lexeme == “‘“: typ, tok, consumed = lex_str(line[lexeme_count:]) lexeme_count += consumed elif lexeme.isalpha(): typ, tok, consumed = lex_id(line[lexeme_count]) lexeme_count += consumed else: lexeme_count += 1 code = input() lex(code)

Time To Make The functions!

We used three functions, but we haven’t defined them. Let’s go ahead and do that.

def lex_num(line): num= “” def lex_str(line): delimiter = line[0] string = “” def lex_id(line): id = “”

First we’ll make the lex_num function go till the end of the line and return the number.

def lex_num(line): num= “” for c in line: if not c.isdigit(): break return ‘num’, int(num), len(num) def lex_str(line): delimiter = line[0] string = “” def lex_id(line): id = “”

We will then fill out the lex_str() function doing the same thing as the digit one but for a string instead.

def lex_num(line): num= “” for c in line: if not c.isdigit(): break return ‘num’, int(num), len(num) def lex_str(line): delimiter = line[0] string = “” for c in line: string += c return ‘str’, string, len(string) def lex_id(line): id = “”

And now we will fill out the lex_id() function!

def lex_num(line): num= “” for c in line: if not c.isdigit(): break return ‘num’, int(num), len(num) def lex_str(line): delimiter = line[0] string = “” for c in line: if c==delimiter: break string += c return ‘str’, string, len(string) def lex_id(line): id = “” for c in line if not c.isdigit() and not c.isalpha and c != “_”: break id += c return ‘ID’, id, len(id)

What About KeyWords?

Yes, we will need to change the lex_id() function to know about key words...
What are you waiting for, read on!

We are going to make a list of keywords and check the id.

def lex_num(line): num= “” for c in line: if not c.isdigit(): break return ‘num’, int(num), len(num) def lex_str(line): delimiter = line[0] string = “” for c in line: if c==delimiter: break string += c return ‘str’, string, len(string) def lex_id(line): keys = [‘print’, ‘var’, ‘while’, ‘if’, ‘elif’, ‘else’] id = “” for c in line if not c.isdigit() and not c.isalpha and c != “_”: break id += c if id in keys: return ‘key’, id, len(id) else: return ‘ID’, id, len(id)

The Entire Code

I know you want to go out and try this, but if you need it, here is the full working code.
BTW if you copy and paste this code, it will result in an error because I use curly quotes and those are not used in programming ‘“‘“‘“‘“‘“‘, I guess you either have to manually take them out and replace them XD, or just look at this code as a reference.
If you want to copy and paste :(, do it below on my better lexer

def lex_num(line): num= “” for c in line: if not c.isdigit(): break return ‘num’, int(num), len(num) def lex_str(line): delimiter = line[0] string = “” for c in line: if c==delimiter: break string += c return ‘str’, string, len(string) def lex_id(line): keys = [‘print’, ‘var’, ‘while’, ‘if’, ‘elif’, ‘else’] id = “” for c in line if not c.isdigit() and not c.isalpha and c != “_”: break id += c if id in keys: return ‘key’, id, len(id) else: return ‘ID’, id, len(id) def lex(line): lexeme_count = 0 while lexeme_count < len(line): lexeme = line[lexeme_count] if lexeme.isdigit(): typ, tok, consumed = lex_num(line[lexeme_count:]) lexeme_count += consumed elif lexeme == ‘“‘ or lexeme == “‘“: typ, tok, consumed = lex_str(line[lexeme_count:]) lexeme_count += consumed elif lexeme.isalpha(): typ, tok, consumed = lex_id(line[lexeme_count]) lexeme_count += consumed else: lexeme_count += 1 code = input() lex(code)
Voters
Profile icon
KirillMcQuilli1
Profile icon
ThIsIsAnUsEr
Profile icon
RizkyRamadhan
Profile icon
B0T
Profile icon
programmeruser
Profile icon
CoolCoderSJ
Profile icon
elipie
Profile icon
ELDER054
Comments
hotnewtop
Profile icon
Wumi4

Nice! But I think that you should start with making tokens part first, for example, building a Token class that contains informations about the token(like column, lexeme, line and type), then start with the lexer. It will make everything more convenient since you just need to create a Token instance, eat informations for it, and return it to the user. It follows the DRY(Don't Repeat Yourself) rule very well! But anyway, nice one!

Profile icon
ELDER054

I’m still working on it,

@Wumi4

Profile icon
Wumi4

@Elderosa
I know. Its just a suggestion.

Profile icon
ELDER054

I think it’ll be better, especially if I am adding a parser to my project. I’ll need to make a class for that also,

@Wumi4

Profile icon
NigamDahal

Yeah, I agree with

@Wumi4
. This is more of a tokenizer rather than a lexer.

Profile icon
NigamDahal

@Elderosa
I'd love to see a parser. AST generation is not that simple, but good luck!

Profile icon
ELDER054

This is my first try,

@NigamDahal

Profile icon
programmeruser

@Elderosa
parsing isn't that hard if you use recursive descent.

Profile icon
ELDER054
Profile icon
ELDER054

btw, I’ve realized that handwritten compilers/interpreters are better :),

@programmeruser

Profile icon
adl212

could you make a parser so that i don't need to code my own programming language? https://repl.it/@adl212/programmingLangScratch don't actually do it lol

Profile icon
ELDER054

I will make a parser, but for myself. I might also make a tutorial on it like this,

@adl212

Profile icon
adl212

@Elderosa
Cool!

Profile icon
ELDER054

I’d like to see what this looks like in C++ or C. I kinda want it to be in a lower level language,

@adl212

Profile icon
adl212

@Elderosa
I think there are more possibilities of your coding language by using c++ or c. Like for example, in python you can only set variables one way, but in other languages you can do const <variable> = <value> which makes it read-only.

Profile icon
ELDER054

You can try and convert it to C++,

@adl212

Profile icon
ELDER054

I’d like to see if it is faster or better, including your const example.

@adl212

Profile icon
adl212

@Elderosa
I really don't know because I haven't coded c before lol

Profile icon
ELDER054

What about c++? M mind goes blank when it comes to C,

@adl212

Profile icon
adl212

@Elderosa
nope i only do python lol

Profile icon
ELDER054
Profile icon
[deleted]

best tutorial i found on the subject

Profile icon
ELDER054

Really!. Thank you so much,

@LucasAllori

Profile icon
CoolCoderSJ

Btw also when appending to tokens use a list, not a tuple. It's easier to manage and call. So basically make tokens a 2D List

Profile icon
ELDER054
Profile icon
ELDER054
Profile icon
ELDER054

Can I make a list of lists?

@CoolCoderSJ

Profile icon
CoolCoderSJ

@Elderosa
yes u can, i forked it and did it, and it worked. its called a 2d list.

I explain 2D lists thoroughly here

Once you make your account, go to option 4, the manual level picker, and choose level 8.

Profile icon
CoolCoderSJ
Profile icon
ELDER054
Profile icon
elipie

noice

Profile icon
ELDER054

I’m not done yet,

@elipie

Profile icon
ELDER054

With the tutorial at least. The Lexer is pretty good

@Elderosa

Profile icon
elipie

@Elderosa
shouldnt have posted it, then. lol, anything with PL dev, is an insta upvote from me

Profile icon
ELDER054

I’m trying to figure out how I should make my parse tree?

@elipie

Profile icon
elipie

@Elderosa
uhm, well, the syntax tree in a program isnt actually a tree. It literally is just doing stuff with the tokens. So dont try to make a tree, cuz you will get lost in trying to figure out where everything goes.

You know what a parse tree i am guessing, so its just the coding. THats the hard part lol.

this probably didnt help you at all
Profile icon
ELDER054

ik lol. So I should just like try to run the tokens then and there?

@elipie

Profile icon
ELDER054

This is not a finished post