How to count words in Python

Learn how to count words in Python. This guide covers various methods, tips, real-world uses, and common errors to help you master the task.

How to count words in Python
Published on: 
Tue
Mar 3, 2026
Updated on: 
Wed
Apr 1, 2026
The Replit Team

Word count in Python is a fundamental task for text analysis and data processing. Python's built-in functions and string methods provide straightforward ways to count words with simple syntax.

In this article, you'll explore several techniques to count words, from the simple split() method to more advanced approaches. You'll get practical tips, see real-world applications, and receive debugging advice.

Basic word counting with split()

text = "This is a simple example of word counting in Python."
word_count = len(text.split())
print(f"Word count: {word_count}")--OUTPUT--Word count: 9

The split() method does the heavy lifting here. When you call it without any arguments, it intelligently breaks the string into a list of words, using any whitespace as the separator. This is great because it automatically handles single spaces, tabs, and newlines, making it a reliable first step for simple texts.

Once the string becomes a list of words, the len() function simply counts the number of items in that list. This combination gives you a quick and accurate word count, especially for text that doesn't have complex punctuation to consider.

Basic word counting techniques

While split() is great for simple cases, you can gain more precision by defining custom delimiters, counting word frequencies, or using powerful regular expressions.

Using custom delimiters with split()

text = "This is a simple,example of word-counting in Python."
word_count = len(text.replace(",", " ").replace("-", " ").split())
print(f"Word count: {word_count}")--OUTPUT--Word count: 9

Real-world text isn't always clean. You'll often find punctuation that can throw off a simple word count. This example uses method chaining to handle specific delimiters before splitting the string.

  • First, replace(",", " ") swaps every comma with a space.
  • Next, replace("-", " ") does the same for hyphens, effectively separating words like word-counting.
  • Only then does split() break the sanitized string into a list for an accurate count.

Counting word frequency with collections.Counter

from collections import Counter
text = "apple banana apple orange banana apple"
word_freq = Counter(text.split())
print(word_freq)
print(f"Most common word: {word_freq.most_common(1)[0][0]}")--OUTPUT--Counter({'apple': 3, 'banana': 2, 'orange': 1})
Most common word: apple

When you need more than just a total count, collections.Counter is your go-to tool. It takes an iterable—in this case, the list from text.split()—and creates a dictionary-like object that maps each unique word to its frequency. This gives you a quick summary of your text's vocabulary.

  • The most_common() method is especially useful. It returns a list of tuples, ordered from the most frequent word to the least, making it simple to identify top keywords.

Using regular expressions for flexible word counting

import re
text = "Hello, world! This is a test. It has 4 sentences and some punctuation."
words = re.findall(r'\b[a-zA-Z]+\b', text)
print(f"Word count: {len(words)}")
print(words[:5]) # First 5 words--OUTPUT--Word count: 12
['Hello', 'world', 'This', 'is', 'a']

When you need surgical precision, regular expressions are your best tool. The re.findall() function extracts all text matching a pattern, giving you fine-grained control. The pattern r'\b[a-zA-Z]+\b' specifically targets alphabetic words.

  • \b represents a word boundary. This ensures you're matching whole words, not just parts of them.
  • [a-zA-Z]+ matches one or more letters, effectively ignoring numbers and punctuation.

This method returns a clean list of words, which len() can then easily count.

Advanced word counting approaches

Building on these fundamentals, you can tackle more complex text with specialized libraries, custom classes, and memory-efficient generators for greater control and scale.

Using NLTK for tokenization and word counting

import nltk
nltk.download('punkt', quiet=True)
text = "Hello, world! This is a test."
tokens = nltk.word_tokenize(text)
print(f"Word count: {len(tokens)}")
print(tokens)--OUTPUT--Word count: 8
['Hello', ',', 'world', '!', 'This', 'is', 'a', 'test', '.']

For more advanced text analysis, you can use the Natural Language Toolkit (NLTK), a powerful library for natural language processing. Its word_tokenize() function breaks text into a list of tokens, which includes both words and punctuation. This process, called tokenization, is a cornerstone of NLP.

  • Unlike the basic split() method, word_tokenize() treats punctuation marks like commas and periods as separate items in the list.
  • This gives you a more granular view of the text—it's an essential first step for tasks like sentiment analysis or part-of-speech tagging.

Creating a custom word counter class

class WordCounter:
def __init__(self, text):
self.text = text
self.words = text.lower().split()

def count(self):
return len(self.words)

def unique_count(self):
return len(set(self.words))

counter = WordCounter("apple banana apple orange")
print(f"Total words: {counter.count()}, Unique words: {counter.unique_count()}")--OUTPUT--Total words: 4, Unique words: 3

Creating a custom class like WordCounter bundles related functionality into a single, reusable object. This approach keeps your code organized and easy to manage. When you create a WordCounter instance, the __init__ method immediately processes the input text by converting it to lowercase and splitting it into words.

  • The count() method gives you the total number of words.
  • The unique_count() method cleverly uses a set to find the number of unique words, since sets automatically discard duplicates.

Memory-efficient word counting with generators

def count_words_from_file(filename):
with open(filename, 'r') as file:
for line in file:
yield len(line.split())

# Example with string lines instead of a file
total = sum([len(line.split()) for line in ["This is line one", "This is line two"]])
print(f"Total word count: {total}")--OUTPUT--Total word count: 8

When you're working with large files, loading everything into memory at once can be inefficient. Generators offer a smarter solution. The count_words_from_file function uses the yield keyword to process the file one line at a time, which is far more memory-friendly.

  • Instead of returning a single, large list of words, the function reads a line, counts its words, and yields the result.
  • It then moves to the next line without storing the previous one in memory.

This approach is ideal for analyzing massive text files without overwhelming your system.

Move faster with Replit

Replit is an AI-powered development platform that comes with all Python dependencies pre-installed, so you can skip setup and start coding instantly. With Agent 4, you can move from piecing together individual techniques to building complete, working applications.

Instead of just practicing with split() or Counter, you can describe the final tool you want, and Agent will build it:

  • A keyword density tool that analyzes text and calculates the frequency of specific terms for SEO analysis.
  • A text-cleaning utility that uses regular expressions to strip all punctuation and numbers from a document.
  • A memory-efficient log summarizer that processes large files line-by-line to count error types.

Simply describe your app, and Replit will write the code, test it, and fix issues automatically, all within your browser.

Common errors and challenges

Even with Python's simple tools, you might run into a few common pitfalls when counting words, but they're all easy to fix.

  • Handling empty or non-existent text: Calling split() on a None value will crash your script with an AttributeError. An empty string ("") is also tricky—"".split() returns a list containing one empty string, [''], giving you an incorrect word count of 1. It's always best to check your input before processing it.
  • Fixing case sensitivity: Word counters are case-sensitive by default, so "Python" and "python" are treated as two different words. To get an accurate frequency count, convert your text to a single case with the lower() method before you split and count. This ensures your counts aren't skewed by capitalization.
  • Managing large files: If you try to read a massive text file into memory all at once, you'll risk running out of RAM and crashing your program. The solution is to process the file line by line, a memory-efficient approach that works for files of any size.

Handling None values and empty strings when using split()

One of the most common tripwires is feeding your word counting function a None value instead of text. Since None doesn't have a split() method, your script will immediately crash with an AttributeError. The following code demonstrates this exact scenario.

def count_words(text):
return len(text.split())

# This will raise an AttributeError
text_sample = None
word_count = count_words(text_sample)
print(f"Word count: {word_count}")

The function passes the None value directly to text.split(). Since the None object doesn't have a split() method, Python raises an AttributeError. The following code demonstrates a simple check to avoid this crash.

def count_words(text):
if text is None or text == "":
return 0
return len(text.split())

text_sample = None
word_count = count_words(text_sample)
print(f"Word count: {word_count}") # Outputs: Word count: 0

The fix is a simple guard clause. The condition if text is None or text == "" checks for invalid input before any processing happens. If the text is empty or doesn't exist, the function immediately returns 0, which prevents the AttributeError from ever occurring. You'll want to use this kind of check whenever you're handling text from unpredictable sources, like user input, API responses, or files, to make your code more robust.

Fixing case sensitivity issues in word frequency counting

Case sensitivity can easily skew your word frequency counts. By default, Python treats Apple, apple, and APPLE as three distinct words, which isn't what you typically want. This leads to inaccurate results. The following code demonstrates this problem in action.

from collections import Counter
text = "Apple apple APPLE Orange orange"
word_freq = Counter(text.split())
print(word_freq) # Counts 'Apple', 'apple', and 'APPLE' as different words

The Counter object creates a separate entry for each unique string. Because Apple, apple, and APPLE are distinct, the frequency count gets split across three different keys. The following code demonstrates the simple correction.

from collections import Counter
text = "Apple apple APPLE Orange orange"
word_freq = Counter(text.lower().split())
print(word_freq) # Now counts all variants of 'apple' as the same word

The fix is to convert the text to a single case before counting. By calling the lower() method on the string first, you standardize all words. This ensures that Counter treats "Apple" and "apple" as the same word, giving you an accurate frequency count. You'll want to do this whenever you're analyzing text from user input or external files, where capitalization is often inconsistent and can skew your results.

Memory-efficient word counting for large files

When you're working with massive text files, reading the entire file into memory at once is a recipe for disaster. This common approach, using file.read(), can quickly exhaust your system's RAM and crash your program. The code below demonstrates this inefficient method.

def count_words_in_large_file(filename):
with open(filename, 'r') as file:
content = file.read()
return len(content.split()) # Loads entire file into memory

This function reads the entire file into memory with file.read(), which is risky. A large file can easily overwhelm your system's resources, causing a crash. The corrected version below offers a more robust and memory-safe solution.

def count_words_in_large_file(filename):
total = 0
with open(filename, 'r') as file:
for line in file:
total += len(line.split())
return total # Processes one line at a time

The fix is to process the file line by line. Instead of using file.read(), the code iterates directly over the file object. This reads one line into memory at a time, counts its words with line.split(), and adds the result to a running total. This memory-safe technique is essential for analyzing large text files, such as application logs or data exports, without risking a crash from memory exhaustion.

Real-world applications

Beyond just getting a total, these counting techniques unlock practical applications like analyzing text readability and measuring document similarity.

Analyzing text readability with split() and counting

Combining word and sentence counts allows you to calculate the average words per sentence, a simple metric for gauging a text's readability.

text = "This is a simple example. It has three short sentences. Reading level varies by sentence length."
words = len(text.split())
sentences = len([s for s in text.replace('!', '.').replace('?', '.').split('.') if s.strip()])
avg_words_per_sentence = words / sentences

print(f"Words: {words}, Sentences: {sentences}")
print(f"Average words per sentence: {avg_words_per_sentence:.1f}")
print(f"Reading level: {'Easy' if avg_words_per_sentence < 10 else 'Complex'}")

This snippet calculates the average sentence length to classify the text's complexity. It first gets a total word count using len(text.split()). The real work happens when counting sentences, where it's a bit more clever.

  • It standardizes punctuation by replacing ! and ? with . before splitting the text.
  • A list comprehension with s.strip() filters out any empty strings that result from the split, ensuring an accurate sentence count.

Finally, it divides the word count by the sentence count and uses a conditional expression to label the text as 'Easy' or 'Complex'.

Finding document similarity with set operations on words

You can also measure how similar two documents are by comparing their unique vocabularies using Python’s set operations.

def jaccard_similarity(doc1, doc2):
words1 = set(doc1.lower().split())
words2 = set(doc2.lower().split())

intersection = words1.intersection(words2)
union = words1.union(words2)

return len(intersection)/len(union), intersection

text1 = "Python is a versatile programming language"
text2 = "Python is used for web development and data science"

similarity, common_words = jaccard_similarity(text1, text2)
print(f"Similarity score: {similarity:.2f}")
print(f"Shared words: {common_words}")

This function leverages Python’s set data type to efficiently compare two texts. It first normalizes each document by converting it to lowercase and splitting it into a collection of unique words. This setup makes comparing the vocabularies straightforward.

  • The function is particularly useful because it returns a tuple.
  • You get a numeric similarity score and the specific set of shared words, providing both quantitative and qualitative insight in a single call.

Get started with Replit

Put these techniques into practice by building a real tool with Replit Agent. Describe what you want, like “a keyword density tool for SEO” or “a readability score calculator that analyzes text from a URL.”

Replit Agent will write the code, test for errors, and help you deploy your app. Start building with Replit and turn your concept into a working application.

Build your first app today

Describe what you want to build, and Replit Agent writes the code, handles the infrastructure, and ships it live. Go from idea to real product, all in your browser.

Get started for free

Create & deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.