How to count words in Python
Learn how to count words in Python. This guide covers various methods, tips, real-world applications, and how to debug common errors.

Word count is a frequent task in text analysis. Python’s split() method and other tools make it simple to handle this with just a few lines of code.
In this article, you'll explore several techniques for word counting. You'll also find practical tips, see real-world applications, and get advice for debugging common issues you might encounter.
Basic word counting with split()
text = "This is a simple example of word counting in Python."
word_count = len(text.split())
print(f"Word count: {word_count}")--OUTPUT--Word count: 9
The split() method is the workhorse in this example. By default, it breaks a string into a list of words using any whitespace as the delimiter. This makes it effective for handling single spaces, multiple spaces, or even tabs between words without extra logic.
After split() creates the list, the len() function returns the number of items it contains. This combination is highly efficient for getting a quick word count from straightforward text where complex punctuation isn't a factor.
Basic word counting techniques
Beyond the default behavior of split(), you can refine your word counting by using custom delimiters, tracking word frequencies, or applying flexible pattern matching.
Using custom delimiters with split()
text = "This is a simple,example of word-counting in Python."
word_count = len(text.replace(",", " ").replace("-", " ").split())
print(f"Word count: {word_count}")--OUTPUT--Word count: 9
When text includes punctuation like commas, the default split() method isn't enough. This example demonstrates a common technique for handling such cases by pre-processing the string before counting.
- The
replace()method is used to swap unwanted characters with a space. - You can chain multiple
replace()calls to handle various delimiters, like the comma and hyphen in the code.
After replacing the punctuation, split() can then operate on the cleaned string to accurately count the words.
Counting word frequency with collections.Counter
from collections import Counter
text = "apple banana apple orange banana apple"
word_freq = Counter(text.split())
print(word_freq)
print(f"Most common word: {word_freq.most_common(1)[0][0]}")--OUTPUT--Counter({'apple': 3, 'banana': 2, 'orange': 1})
Most common word: apple
When you need to know not just the number of words but also how often each appears, the Counter class from the collections module is your best friend. It takes an iterable—like the list of words from split()—and returns a dictionary-like object that maps each word to its frequency.
- The
Counterobject gives you a quick overview of all word counts at once. - For more specific analysis, you can use its built-in methods. The
most_common()method is particularly useful for finding the most frequent words in your text.
Using regular expressions for flexible word counting
import re
text = "Hello, world! This is a test. It has 4 sentences and some punctuation."
words = re.findall(r'\b[a-zA-Z]+\b', text)
print(f"Word count: {len(words)}")
print(words[:5]) # First 5 words--OUTPUT--Word count: 12
['Hello', 'world', 'This', 'is', 'a']
For text with mixed punctuation and numbers, regular expressions offer a more precise way to count words. The re.findall() function finds all substrings that match a specific pattern, giving you fine-grained control over what you define as a "word."
- The pattern
r'\b[a-zA-Z]+\b'specifically looks for sequences of one or more letters. - The
\brepresents a word boundary. This ensures that only whole words are matched, which is why numbers and punctuation are automatically excluded from the count.
This approach is more robust than simple splitting when your text isn't perfectly clean.
Advanced word counting approaches
Building on the basics, you can tackle more complex text analysis by using specialized libraries, creating custom classes, or optimizing for memory with generators.
Using NLTK for tokenization and word counting
import nltk
nltk.download('punkt', quiet=True)
text = "Hello, world! This is a test."
tokens = nltk.word_tokenize(text)
print(f"Word count: {len(tokens)}")
print(tokens)--OUTPUT--Word count: 8
['Hello', ',', 'world', '!', 'This', 'is', 'a', 'test', '.']
For more advanced text analysis, you can use the Natural Language Toolkit (NLTK). The library’s word_tokenize() function splits text into a list of tokens, which are individual units of text like words or punctuation.
- Unlike basic splitting,
word_tokenize()separates punctuation from words, treating each as a distinct token. - This level of detail is essential for many natural language processing tasks where punctuation itself carries important meaning.
Creating a custom word counter class
class WordCounter:
def __init__(self, text):
self.text = text
self.words = text.lower().split()
def count(self):
return len(self.words)
def unique_count(self):
return len(set(self.words))
counter = WordCounter("apple banana apple orange")
print(f"Total words: {counter.count()}, Unique words: {counter.unique_count()}")--OUTPUT--Total words: 4, Unique words: 3
For more control, you can build a custom WordCounter class. This approach packages your logic into a reusable object, making your code cleaner and more organized. When an instance is created, the __init__ method automatically processes the input text by converting it to lowercase and splitting it into words.
- The
count()method simply returns the total number of words. - The
unique_count()method calculates the number of distinct words by converting the list to aset, which automatically handles duplicate entries for you.
Memory-efficient word counting with generators
def count_words_from_file(filename):
with open(filename, 'r') as file:
for line in file:
yield len(line.split())
# Example with string lines instead of a file
total = sum([len(line.split()) for line in ["This is line one", "This is line two"]])
print(f"Total word count: {total}")--OUTPUT--Total word count: 8
When you're dealing with large files, loading the entire text into memory isn't practical. Generators provide a memory-efficient alternative. The function uses the yield keyword to process the file one line at a time instead of all at once.
- It reads a single line, counts the words using
split(), and yields the count. - This process repeats for every line, ensuring only a small amount of data is in memory at any moment.
You can then aggregate these individual counts, for example by using the sum() function, to get a final total.
Move faster with Replit
Replit is an AI-powered development platform that transforms natural language into working applications. Describe what you want to build, and Replit Agent creates it—complete with databases, APIs, and deployment.
For the word counting techniques we've explored, Replit Agent can turn them into production-ready tools:
- Build an SEO tool that calculates keyword density from a URL or text input.
- Create a text analysis dashboard that displays word count, unique words, and the most frequent terms.
- Deploy a readability score calculator that processes text to determine its complexity.
Describe your app idea, and Replit Agent writes the code, tests it, and fixes issues automatically, all in your browser.
Common errors and challenges
When counting words in Python, you might run into a few common pitfalls, but they're all straightforward to solve.
Handling None values and empty strings when using split()
A frequent error occurs when you try to call split() on a variable that holds a None value, which raises an AttributeError. An empty string won't cause an error, but it will produce an empty list, which might not be the behavior you expect.
To prevent this, it's good practice to add a check before processing your text. A simple conditional statement can verify that your input is a non-empty string before you attempt to split it.
Fixing case sensitivity issues in word frequency counting
When you're counting word frequencies, case sensitivity can easily skew your results. For example, "Python" and "python" will be counted as two distinct words, even though you'd likely want to treat them as the same.
The fix is simple: convert the entire text to a single case before you start counting. Using the lower() method to make all text lowercase is the standard approach and ensures your frequency counts are accurate.
Memory-efficient word counting for large files
If you try to read a massive file into a single string, you risk running out of memory and crashing your program. This is a common challenge when working with large datasets or log files.
Instead of loading everything at once, you should process the file line by line or in manageable chunks. This approach keeps memory usage low and stable, no matter how large the file is.
Handling None values and empty strings when using split()
Trying to process text that might be None is a classic stumbling block. If you call split() on a None value, your program will stop with an AttributeError. The code below shows exactly what this looks like in a real-world scenario.
def count_words(text):
return len(text.split())
# This will raise an AttributeError
text_sample = None
word_count = count_words(text_sample)
print(f"Word count: {word_count}")
The count_words function directly calls split() on its input. When that input is None, the program crashes because the None type has no split() method. The following code demonstrates how to handle this gracefully.
def count_words(text):
if text is None or text == "":
return 0
return len(text.split())
text_sample = None
word_count = count_words(text_sample)
print(f"Word count: {word_count}") # Outputs: Word count: 0
The solution is to add a simple guard clause. The updated count_words function first checks if the text input is None or an empty string. If either is true, it returns 0 right away, sidestepping the error. This defensive programming is a good habit, especially when dealing with data from user input or API calls, since you can't always be sure what you'll get.
Fixing case sensitivity issues in word frequency counting
Your word frequency counts can be easily skewed by case sensitivity. A tool like Counter will see "Apple" and "apple" as two separate words, which isn't what you'd expect. The code below shows this problem in action, creating separate counts for each variation.
from collections import Counter
text = "Apple apple APPLE Orange orange"
word_freq = Counter(text.split())
print(word_freq) # Counts 'Apple', 'apple', and 'APPLE' as different words
The Counter object is case-sensitive, so it registers "Apple", "apple", and "APPLE" as separate items. This splits the frequency count across multiple entries instead of consolidating them. See how to resolve this in the code below.
from collections import Counter
text = "Apple apple APPLE Orange orange"
word_freq = Counter(text.lower().split())
print(word_freq) # Now counts all variants of 'apple' as the same word
The solution is to convert the text to a single case before counting. By calling the lower() method on the string first, you normalize the text. This ensures that Counter treats all variations like "Apple" and "apple" as the same word. It's a simple preprocessing step that's crucial whenever you're analyzing word frequencies, as it prevents your counts from being split across different capitalizations and gives you a more accurate result.
Memory-efficient word counting for large files
When you're working with large files, reading the entire content into memory at once can crash your program. This approach, often done with file.read(), consumes significant RAM and becomes a major bottleneck. The code below demonstrates this exact problem.
def count_words_in_large_file(filename):
with open(filename, 'r') as file:
content = file.read()
return len(content.split()) # Loads entire file into memory
The file.read() function loads the entire file into one string, which can exhaust your system's memory and crash the program. The code below demonstrates a more efficient way to process the file without this risk.
def count_words_in_large_file(filename):
total = 0
with open(filename, 'r') as file:
for line in file:
total += len(line.split())
return total # Processes one line at a time
The solution iterates through the file, processing it one line at a time. This approach is memory-efficient because it never holds the entire file's content in memory at once. Inside the loop, it uses len(line.split()) to count words in the current line and adds them to a running total. This technique is essential when you're working with large datasets or log files that are too big to handle in a single operation.
Real-world applications
With these techniques for handling errors and complex text, you can build powerful tools for real-world text analysis.
One practical application is analyzing text readability. Formulas like the Flesch-Kincaid test measure how easy a document is to understand by using metrics like word count and sentence count. You can get the word count using split() and count sentences by splitting the text on punctuation like periods and question marks. By combining these counts, you can calculate a readability score, which is useful for everything from checking if a blog post is accessible to ensuring legal documents are clear.
You can also use word counting to measure document similarity. This is helpful for tasks like detecting plagiarism or grouping related articles. The process involves converting each document’s text into a set of unique words, which you saw earlier with the custom WordCounter class. Once you have two sets of words, you can use set operations to compare them. For example, the Jaccard index—a common similarity metric—is calculated by dividing the number of shared words (the intersection) by the total number of unique words across both documents (the union). The result is a score that tells you how much vocabulary the two texts have in common.
Analyzing text readability with split() and counting
Calculating the average words per sentence is a straightforward way to approximate a text's reading level.
text = "This is a simple example. It has three short sentences. Reading level varies by sentence length."
words = len(text.split())
sentences = len([s for s in text.replace('!', '.').replace('?', '.').split('.') if s.strip()])
avg_words_per_sentence = words / sentences
print(f"Words: {words}, Sentences: {sentences}")
print(f"Average words per sentence: {avg_words_per_sentence:.1f}")
print(f"Reading level: {'Easy' if avg_words_per_sentence < 10 else 'Complex'}")
This snippet breaks down text analysis into a few key steps to determine readability. It starts with a simple word count using len(text.split()). The real work happens when counting sentences.
- It first normalizes the text by replacing all
!and?characters with a period usingreplace(). - A list comprehension then splits the string by periods and filters out any empty results with
s.strip(), ensuring an accurate sentence count.
The final average is used to assign a basic 'Easy' or 'Complex' reading level.
Finding document similarity with set operations on words
By converting each document into a set of words, you can easily calculate their similarity and identify the exact words they share.
def jaccard_similarity(doc1, doc2):
words1 = set(doc1.lower().split())
words2 = set(doc2.lower().split())
intersection = words1.intersection(words2)
union = words1.union(words2)
return len(intersection)/len(union), intersection
text1 = "Python is a versatile programming language"
text2 = "Python is used for web development and data science"
similarity, common_words = jaccard_similarity(text1, text2)
print(f"Similarity score: {similarity:.2f}")
print(f"Shared words: {common_words}")
The jaccard_similarity function quantifies how similar two documents are based on their shared vocabulary. It starts by converting each document into a set of unique, lowercase words, which makes the comparison case-insensitive and ignores duplicate words within the same text.
- The
intersection()method identifies the words that are common to both sets. - The
union()method creates a single set containing all unique words from both documents combined.
The final similarity score is the ratio of the shared word count to the total unique word count. A score closer to 1.0 indicates a higher degree of similarity.
Get started with Replit
Now, turn these concepts into a real application. Tell Replit Agent to “build a readability score calculator” or “create a keyword density checker that analyzes text from a URL.”
The agent writes the code, tests for errors, and deploys your app automatically. All you need is an idea. Start building with Replit.
Create and deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.
Create & deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.



.png)