How to count tokens in Python

Learn to count tokens in Python. Explore different methods, tips and tricks, real-world applications, and how to debug common errors.

How to count tokens in Python
Published on: 
Wed
Mar 25, 2026
Updated on: 
Thu
Mar 26, 2026
The Replit Team

The ability to count tokens in Python is a crucial skill for working with large language models. It helps manage API costs, optimize performance, and ensure prompts fit model context limits.

Here, you'll explore several techniques for accurate token counting. You will find practical tips, see real-world applications, and get debugging advice to help you master tokenization in your projects.

Basic token counting with split()

text = "This is a simple example of counting tokens in Python."
tokens = text.split()
token_count = len(tokens)
print(f"Token count: {token_count}")
print(f"Tokens: {tokens}")--OUTPUT--Token count: 9
Tokens: ['This', 'is', 'a', 'simple', 'example', 'of', 'counting', 'tokens', 'in', 'Python.']

The simplest approach to token counting in Python uses the built-in split() method. This function breaks a string into a list of substrings wherever it finds whitespace. You can then use len() to get the total count, which gives you a quick word count that approximates the token count.

While this method is fast and requires no external libraries, it isn't precise enough for most LLM applications. Notice how split() keeps punctuation attached to words, like in 'Python.'. True tokenizers used by large language models handle punctuation, capitalization, and subwords much more effectively.

Common tokenization techniques

To achieve that level of precision, you can move beyond split() and use more powerful tools for pattern matching, frequency analysis, and linguistic tokenization.

Using re.findall() for pattern-based tokenization

import re
text = "Count tokens, including punctuation! It's important."
tokens = re.findall(r'\b\w+(?:\'t|\'s)?\b|[^\w\s]', text)
print(f"Token count: {len(tokens)}")
print(f"Tokens: {tokens}")--OUTPUT--Token count: 10
Tokens: ['Count', 'tokens', ',', 'including', 'punctuation', '!', "It's", 'important', '.']

For more precise control, you can use Python's regular expression module. The re.findall() function finds all substrings that match a specific pattern, giving you a more granular way to define tokens. This approach is a significant step up from a simple split().

The pattern r'\b\w+(?:\'t|\'s)?\b|[^\w\s]' does two key things:

  • It identifies whole words while correctly handling common contractions like It's.
  • It captures any character that isn't a letter, number, or space—effectively isolating punctuation.

This method correctly separates words from punctuation, a crucial step for accurate tokenization in LLM applications.

Counting token frequencies with collections.Counter

from collections import Counter
text = "token token another token and another"
token_counts = Counter(text.split())
print(f"Unique tokens: {len(token_counts)}")
print(f"Token frequencies: {dict(token_counts)}")--OUTPUT--Unique tokens: 4
Token frequencies: {'token': 3, 'another': 2, 'and': 1}

Beyond just counting total tokens, you often need to know how frequently each token appears. Python's collections.Counter is perfect for this. It takes an iterable, such as the list from text.split(), and returns a dictionary-like object that tallies the occurrences of each item.

  • Counter simplifies frequency analysis, giving you a quick overview of your text's vocabulary.
  • The resulting object shows both the unique tokens and their respective counts.

Using the nltk library for natural language tokenization

import nltk
nltk.download('punkt', quiet=True)
text = "NLTK handles contractions, like don't and won't!"
tokens = nltk.word_tokenize(text)
print(f"Token count: {len(tokens)}")
print(f"Tokens: {tokens}")--OUTPUT--Token count: 12
Tokens: ['NLTK', 'handles', 'contractions', ',', 'like', 'do', "n't", 'and', 'wo', "n't", '!']

For more advanced needs, the Natural Language Toolkit (NLTK) is a go-to library. Its word_tokenize() function is specifically designed for linguistic tasks and understands the structure of human language far better than simple string methods.

  • It intelligently splits text based on grammatical rules, not just whitespace or punctuation.
  • Notice how it correctly breaks down contractions like don't into do and n't. This level of detail is essential for many NLP applications, including sentiment analysis and machine translation.

Advanced tokenization approaches

While tools like NLTK are great for general use, you'll need more advanced approaches for multilingual text, subword tokenization, and domain-specific language.

Working with multiple languages using spaCy

import spacy
nlp = spacy.load("en_core_web_sm")
text = "SpaCy tokenizes text intelligently based on language rules."
doc = nlp(text)
tokens = [token.text for token in doc]
print(f"Token count: {len(tokens)}")
print(f"Tokens: {tokens}")--OUTPUT--Token count: 9
Tokens: ['SpaCy', 'tokenizes', 'text', 'intelligently', 'based', 'on', 'language', 'rules', '.']

When you're working with specific languages, spaCy is an excellent choice. It uses pre-trained models—like the English one loaded here with spacy.load("en_core_web_sm")—to apply sophisticated, language-specific rules. This approach is far more advanced than just splitting text by spaces or punctuation.

Processing text with the nlp object returns a doc container, which is much more than a simple list of words.

  • Each item in the doc is a token object containing rich linguistic information, not just a string.
  • You can access the plain text of each token using token.text, as shown in the list comprehension.

Subword tokenization with the transformers library

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
text = "BERT tokenizers use subword units."
tokens = tokenizer.tokenize(text)
print(f"Subword token count: {len(tokens)}")
print(f"Subword tokens: {tokens}")--OUTPUT--Subword token count: 7
Subword tokens: ['bert', 'token', '##izers', 'use', 'sub', '##word', 'units', '.']

For modern LLMs, you'll almost always work with subword tokenization using a library like transformers. The AutoTokenizer class fetches the correct tokenizer for a specific model—in this case, bert-base-uncased. This method is powerful because it can handle words it has never seen before.

  • It breaks down complex or rare words into smaller, recognizable "subword" units. Notice how tokenizers becomes token and ##izers.
  • The ## prefix indicates that the piece is part of the preceding word, not a new one. This helps the model understand vocabulary more efficiently.

Building a custom tokenizer for specialized text

def custom_tokenizer(text, delimiters=[' ', ',', '.', ':', ';', '\n']):
   tokens = []
   current_token = ""
   for char in text:
       if char in delimiters:
           if current_token:
               tokens.append(current_token)
               current_token = ""
       else:
           current_token += char
   if current_token: tokens.append(current_token)
   return tokens

text = "Custom:tokenizer,for specialized;needs"
print(f"Tokens: {custom_tokenizer(text)}")--OUTPUT--Tokens: ['Custom', 'tokenizer', 'for specialized', 'needs']

Sometimes, pre-built libraries don't cut it, especially with unique data formats like log files or custom syntax. In these cases, you can build your own tokenizer. This function gives you complete control over how text is split into tokens.

  • The custom_tokenizer function works by iterating through your text one character at a time.
  • It builds up a token until it hits a character in the delimiters list, like a space or a comma, at which point it saves the completed token.
  • This method is perfect for handling specialized text where standard rules don't apply.

Move faster with Replit

Replit is an AI-powered development platform that transforms natural language into working applications. Describe what you want to build, and Replit Agent creates it, complete with databases, APIs, and deployment.

The tokenization techniques in this article are powerful building blocks. With Replit Agent, you can turn them into production-ready applications:

  • Build an API cost calculator that uses a model's specific tokenizer to estimate expenses before sending a prompt.
  • Create a text analysis dashboard that visualizes token frequency and vocabulary density using collections.Counter.
  • Deploy a multilingual content processor that leverages spaCy to correctly tokenize and handle text across different languages.

You can bring these ideas to life by describing them in plain English. Try Replit Agent and watch it write, test, and deploy your code automatically.

Common errors and challenges

Even with powerful tools, you can run into subtle issues that skew your token counts, but they're easy to fix once you spot them.

  • Removing punctuation when using split(): A simple split() call often leaves punctuation attached to words, which can create incorrect tokens like 'Python.'. You can get a cleaner result by first removing punctuation from your string before you split it by whitespace.
  • Correctly handling contractions with re.findall(): While powerful, a poorly written regular expression can mishandle contractions, splitting don't into don and t. Your pattern needs to be smart enough to recognize common contractions as single units to preserve their meaning.
  • Case-insensitive counting with Counter: The Counter object is case-sensitive, meaning it treats Token and token as two separate items. To get an accurate frequency count, convert your entire text to lowercase with the lower() method before tokenizing.

Removing punctuation when using split()

Using split() for tokenization is simple, but it often leads to inaccurate counts. Because the method only separates text by whitespace, punctuation gets stuck to the end of words. This creates incorrect tokens like 'world!'. Check out the example below.

text = "Hello, world! This is a test."
tokens = text.split()
print(f"Tokens: {tokens}")

The split() method separates the string only by spaces, leaving punctuation attached to words like 'Hello,' and 'world!'. This incorrectly inflates your token count. See how a small adjustment can fix this in the following code.

text = "Hello, world! This is a test."
tokens = [word.strip(',.!?:;') for word in text.split()]
print(f"Tokens: {tokens}")

The solution uses a list comprehension to clean each word after splitting the text. By applying word.strip(',.!?:;') to every item from text.split(), you remove any leading or trailing punctuation specified in the string. This simple step turns messy tokens like 'world!' into clean ones like 'world', giving you a more accurate count. It’s a quick way to refine results when you're using a basic split() for tokenization.

Correctly handling contractions with re.findall()

While re.findall() is powerful, a simple pattern like r'\w+' can't handle contractions correctly. It often splits words like don't into two separate tokens, which misrepresents the word's meaning. The following code demonstrates exactly how this happens.

import re
text = "Don't forget contractions like it's and can't"
tokens = re.findall(r'\w+', text)
print(f"Tokens: {tokens}")

The r'\w+' pattern only matches word characters, treating the apostrophe as a break. This incorrectly splits Don't into Don and t. The following code demonstrates how a more specific pattern solves this.

import re
text = "Don't forget contractions like it's and can't"
tokens = re.findall(r"\b\w+(?:'\w+)?", text)
print(f"Tokens: {tokens}")

The solution uses a more advanced regular expression, r"\b\w+(?:'\w+)?", to correctly identify contractions. This pattern finds a word and then optionally matches an apostrophe followed by more letters. This ensures that words like Don't and it's aren't split apart and are treated as single tokens. You'll want to use this approach when the meaning of contractions is important, like in sentiment analysis, to maintain the text's original intent.

Case-insensitive counting with Counter

The Counter object is case-sensitive, which can be a problem when you're counting word frequencies. It treats Python and python as distinct words, which isn't always what you want. See how this distorts the results in the code below.

from collections import Counter
text = "Python is great. PYTHON is powerful. python is easy."
token_counts = Counter(text.split())
print(f"Token frequencies: {dict(token_counts)}")

The code treats 'Python' and 'python' as distinct tokens, which fragments the frequency count for what is essentially the same word. This gives you a misleading analysis. The following example shows how to get an accurate count.

from collections import Counter
text = "Python is great. PYTHON is powerful. python is easy."
token_counts = Counter(text.lower().split())
print(f"Token frequencies: {dict(token_counts)}")

The solution is to convert the entire string to lowercase before splitting it. By calling the lower() method on your text, you ensure that words like Python, PYTHON, and python are all treated as the same token. This way, you'll get an accurate frequency count. This is crucial for tasks like sentiment analysis or topic modeling, where the word itself matters more than its capitalization.

Real-world applications

With those common pitfalls addressed, you can apply these tokenization skills to practical applications like sentiment analysis and search engine development.

Analyzing sentiment in customer reviews using split()

Even a basic split() method is useful for sentiment analysis, allowing you to quickly classify text by counting positive and negative keywords.

reviews = ["This product is excellent!", "I really dislike this service", "Okay experience overall"]
positive_words = ["good", "great", "excellent", "amazing", "like", "love"]
negative_words = ["bad", "poor", "terrible", "dislike", "hate", "disappointing"]

for review in reviews:
   tokens = review.lower().split()
   pos_count = sum(1 for token in tokens if token in positive_words)
   neg_count = sum(1 for token in tokens if token in negative_words)
   sentiment = "positive" if pos_count > neg_count else "negative" if neg_count > pos_count else "neutral"
   print(f"Review: '{review}' - Sentiment: {sentiment}")

This script classifies text by checking for specific keywords. It loops through each review, standardizes the text with lower(), and tokenizes it with split(). The core logic relies on simple word matching.

  • It tallies positive and negative words by comparing each token against predefined lists.
  • A conditional expression then compares the counts to assign a positive, negative, or neutral label.

This method offers a straightforward way to gauge sentiment by scoring text based on its vocabulary.

Building a simple search engine with TF-IDF tokenization

By combining tokenization with a scoring method called Term Frequency-Inverse Document Frequency (TF-IDF), you can build a simple search engine that ranks documents by relevance.

from collections import Counter
import math

documents = [
   "Python is a programming language",
   "Python is used for web development",
   "Natural language processing uses Python"
]

query = "Python programming"
query_tokens = query.lower().split()

for i, doc in enumerate(documents):
   doc_tokens = doc.lower().split()
   score = 0
   for token in query_tokens:
       if token in doc_tokens:
           tf = doc_tokens.count(token) / len(doc_tokens)
           df = sum(1 for d in documents if token in d.lower().split())
           idf = math.log(len(documents) / df)
           score += tf * idf
   print(f"Doc {i} score: {score:.4f} - '{doc}'")

This script ranks documents by relevance to a search query. It scores each document by analyzing how the words from your query appear within it and across the entire collection.

  • First, it measures how often a query word appears in a single document. This signals the word's local importance.
  • It then checks how many documents contain that same word. A term that’s in every document gets a lower value, which prevents common words from skewing the results.

A document’s final score balances these two factors, effectively surfacing documents where query terms are both frequent and distinctive.

Get started with Replit

Turn your knowledge into a real tool with Replit Agent. Try prompts like, “Build a token cost calculator for OpenAI’s API” or “Create a dashboard that visualizes token frequency from a text file.”

Describe your idea, and the agent will write the code, test for errors, and deploy your application. Start building with Replit.

Get started free

Create and deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.

Get started for free

Create & deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.