Learn to Code via Tutorials on Repl.it!

← Back to all posts
How to make a Programming Language in Python from Scratch: A Basic Language
aguy11 (127)

How to make a Programming Language in Python

Hello everyone! In this tutorial, I will teach you how to make a very basic computer language with a basis of variables and output from scratch/using no libraries.

A Bit of Explanation

Programming languages usually consist of two parts: the lexer and the parser. Most of you have probably heard the word parser before, but if you're completely new to the creation of programming languages, you probably haven't heard the word lexer. To give you the shortest explanation possible, the lexer splits the code into tokens(which we'll talk about later), and the parser takes those tokens and converts them into the preferred language(In our case, Python).

So, how does a Lexer Work Anyway?

Okay, to show you a lexer's magic, let's take this strip of code from a language I thought up of:

createVar name = "Jerry";
This would create a variable with the name name and make its value "Jerry"

And what would a lexer do to this? Well, something like this:

[["IDENTIFIER", "createVar"], ["IDENTIFIER", "name"], ["OPERATOR", "="], ["STRING", "'Jerry'"], ["STATEMENT_END", ";"]]

(Identifiers are non-string words that contain English characters)

As you see, the lexer converts code, or text into tokens, the syntax of which is the following:

The item at Index 0: Type of token
The item at Index 1: The value of the token


You can create more tokens for your language to keep your project more organized, but those are the main ones.

But how would the lexer do this?? Let's find out. Here would be a small lexer that can go through a few types of tokens:

import re #Imports re, which we'll use to check if the token is an integer or identifier

class Lexer(object):
  def __init__(self, code):
    self.source = code # Creates a Lexer class and makes a source variable
  def tokenize(self):
    source_index = 0
    tokens = []
    source_code = self.source.split() #Sets up our lexer and declares a few needed variables, like the list of tokens
    while source_index < len(source_code):
      chip = source_code[source_index] #Iterates through our split code and creates a chip variable for each
      if chip in ['==', "/", "+=", "-=", "-", "+", "*", "<", ">", "<=", ">=", "!=", "or", "and", "in", "not", "="]:
        tokens.append(["OPERATOR", chip]) #Checks if chip is an operator, and if so it appends a token for the designated type
      elif chip[0] in ['"', "'"] and chip[-1] == chip[0] or chip[-2] == chip[0]:
        if chip[-1] == ";":
          tokens.append(["STRING", chip[ :-1]])
          tokens.append(["STATEMENT_END", ";"])
          tokens.append(["STRING", chip]) #Checks if chip is a one-word string and appends it's proper string token, along with a STATEMENT_END if the string ends in it

      elif chip[0] in ['"', "'", "'''"]:
          tr = chip[0]
          while True:
            source_index += 1
            chip = chip + f" {source_code[source_index]}"
            if chip[-1] == tr or chip[-2] == tr:
              if chip[-1] == ";":
                tokens.append(["STRING", chip[ : -1]])  #Checks if chip is a multi-word string, and if so does a process to most likely find the total string and appends a string token to the list, along with a STATEMENT_END, if needed
                tokens.append(["STATEMENT_END", ";"])
                tokens.append(["STRING", chip])
      elif re.match("[a-z]", chip.lower()): #Checks if chip is an identifier, and appends an identifier token, and a STATEMENT_END if it appears to be there
        if chip[-1] == ";" and chip[0] != '"':
          if len(chip) != 2:
            tokens.append(["IDENTIFIER", chip[ : -1]])
            tokens.append(["IDENTIFIER", chip[0]])
          tokens.append(["STATEMENT_END", ";"])
          tokens.append(["IDENTIFIER", chip])
      elif re.match('[0-9]', chip) or re.match('[0-9]', chip):  #Checks if chip is an integer and appends an integer token, along with the many-times-mentioned STATEMENT_END, if so is needed.
        if chip[-1] == ";":
          if len(chip) != 2:
            tokens.append(["INTEGER", chip[ : -1]])
            tokens.append(["INTEGER", chip[0]])
          tokens.append(["STATEMENT_END", ";"])
          tokens.append(["INTEGER", chip])
      source_index += 1
    return tokens

So, hopefully you now understand how the lexer works, let's move on to the parser!!

How does a parser work??

A parser takes the tokens and converts them into a preferred language(Python in our case), and executes the result using exec(), eval(), subprocess.call(), etc. It would also throw some errors along the way if needed.

So, here are our tokens from the lexer:

[["IDENTIFIER", "createVar"], ["IDENTIFIER", "name"], ["OPERATOR", "="], ["STRING", "'Jerry'"], ["STATEMENT_END", ";"]]

And a parser would convert this into the following:

name = "Jerry"

To do this, the parser would go through a process similar to the below:

Iterates through the list of tokens:
  If the token type is IDENTIFIER and token value is "createVar":
    Send unparsed tokens to the variable parser

This would be a very small parser that can only parse variables, the process of which would look like this:

Note: I wrote this in an English-Based version, but for a Python version, look in the code.

For every token in the tokens sent:
  If the token type is STATEMENT_END:
    break the loop
  If the token number is 1 and the token type is IDENTIFIER:
    The variable name is the token value
  If the token number is 1 and the token type is not IDENTIFIER:
   Throw An Error saying "Invalid Variable Name"
  If the token number is 2 and the token type is OPERATOR:
    The variable operator is the token value
  If the token number is 2 and the token type is not OPERATOR:
    Throw Error saying "Invalid Variable Operator"
  If the token number is 3, and the token type is in IDENTIFIER, STRING, INTEGER:
    The variable content is the token value
  If the token number is 3, and the token type is not in IDENTIFIER, STRING, INTEGER:
    Throw an error saying "Invalid Variable Value"
  If the token number is bigger than 3 and the token type is in IDENTIFIER, STRING, INTEGER, OPERATOR:
    The variable value is the variable value + the token value
  If the token number is bigger than 3 and the token type is not in IDENTIFIER, STRING, INTEGER, OPERATOR:
    Throw an error saying "Invalid Variable Value"

This should create three variables: variable name, variable operator, and the variable value.

Now we will send them through a small translator:

class VarObject(object):
  def __init__(self):
    self.exec_code = ""
  def transpile(self, name, operator, value):
    self.exec_code = self.exec_code + name + " " + operator + " " + value + "\n"
    return self.exec_code


You have now made a very basic(And hopefully working computer language). You can add more functions, like output and user input simply by updating the parser to have more parsing systems!

Thanks for reading this!

I hope you learned something!!

EDIT: I'll count this as my 100-cycle celebration

aguy11 (127)

@CodingRedpanda Yeah, I guess, but it´s not like the code uses a ton. Just one.

angrydoge (465)

Do you know how to make if statements in this, or a repeat loop like @Lethdev2019 did?

aguy11 (127)

@dabombdgdzjr Yes, and I've done it in 2 langs now. I've also done functions, so I can help with anything up to that

angrydoge (465)

bo thats basically everything i need @aguy11. My coding language is BO V2 I edited this tutorial a bit, could u help me?

aguy11 (127)

@dabombdgdzjr Yes, of course. Should I comment you a tutorial for if statements and loops first? (Functions are harder, and well, let's say I can only do them at a very limited level. Like the pun though).

angrydoge (465)

Could you like, live help me on a repl? I suck at reading tutorials it all just goes thru and out my head @aguy11

angrydoge (465)

When do you think you can help me on it today @aguy11

angrydoge (465)

Can you help me on it today @aguy11?

angrydoge (465)

@aguy11 are you able to come on today?

aguy11 (127)

@dabombdgdzjr whoa, pings, pings, pings, probably not but I’ll ping you if I can

angrydoge (465)

Can u today its been three days :D @aguy11

aguy11 (127)

@dabombdgdzjr I can right now, but until 6 PM CST.

angrydoge (465)

kk! On my repl or urs cuz we were gonna add the manually right cuz i invited u @aguy11

StringentDev (207)

@aguy11 got a repeat loop into my programming language using this. YAY. But it is not able to use nested loops.

aguy11 (127)

@Lethdev2019 Can you give me the syntax? I might be able to help.

StringentDev (207)

did not exactly work, but we re getting somewhere, the lines have uneven spacing, I have changed it to spaces to visualise the issue @aguy11

StringentDev (207)

all fixed, thanks for the help, I analysed the compiled code and determined it was from the start of repeat, I was correct, there was an extra space. @aguy11

aguy11 (127)

@Lethdev2019 Your welcome! Sorry, I wasn't online to see your other comments, but I'm glad that the issue's been resolved.

StringentDev (207)

@aguy11 and now, i am trying to find out how you created comments in your language.

StringentDev (207)

Seriously, i am using a command rn. @aguy11

aguy11 (127)

@Lethdev2019 Oh, so basically the lexer does that. If it detects a symbol, like ^ or something to begin a comment, it would then iterate through the next words to detect if the last letter is the same symbol. If so, it stops the iteration and adds a token containing ["COMMENT", "CONTENTS OF COMMENT"]. Then if the parser sees a COMMENT token, it would just skip it, not even add it to the code. Hope this helped!

StringentDev (207)

but how do you make it ignore it, i have set up the symbol as "?". @aguy11

aguy11 (127)

@Lethdev2019 Just go self.token_index += 1, and it's complete.

StringentDev (207)

oh, it does not fire? i am not sure what happened thier, so i am running a debug. @aguy11

StringentDev (207)

are you online, i cannot see your cursor at all. @aguy11