How to read a docx file in Python

Learn how to read DOCX files in Python. This guide covers various methods, tips, real-world applications, and common error debugging.

How to read a docx file in Python
Published on: 
Tue
Mar 17, 2026
Updated on: 
Tue
Mar 24, 2026
The Replit Team

You can read DOCX files in Python to automate document processing and data extraction. This is a valuable skill for developers who work with text-based data from Microsoft Word documents.

Here, you'll learn techniques to read and parse DOCX files. You'll find practical tips, explore real-world applications, and get advice to debug common issues you might face.

Using python-docx to read document text

from docx import Document
doc = Document("example.docx")
full_text = []
for para in doc.paragraphs:
full_text.append(para.text)
text = '\n'.join(full_text)
print(text[:100])--OUTPUT--This is a sample document. It contains some text that we want to extract using python-docx. This is just a

The python-docx library treats a Word document as a collection of objects. When you load a file using Document("example.docx"), you get a main document object. This object’s most useful attribute is .paragraphs, which contains a list of all the paragraph objects from the original file.

The code then iterates through this list. For each paragraph, it accesses the .text attribute to get the raw text content. Appending each paragraph's text to a list and then joining them with '\\n' is a clean way to reconstruct the full document text as a single string, correctly maintaining paragraph breaks.

Basic approaches to DOCX parsing

Beyond the object-oriented approach of python-docx, you can also use other libraries or even Python's built-in tools for more direct text extraction.

Extracting text using docx2txt

import docx2txt
text = docx2txt.process("example.docx")
word_count = len(text.split())
print(f"Extracted {word_count} words from the document")
print(text[:50])--OUTPUT--Extracted 25 words from the document
This is a sample document. It contains some text

The docx2txt library offers a more direct alternative for text extraction. Unlike python-docx, which parses the document into objects, the docx2txt.process() function simply reads the entire file and returns its text as a single string.

  • This method is very efficient if you just need the raw text content and don't need to work with the document's structure.
  • After extraction, you can use standard Python string methods, such as split(), to process the text for tasks like counting words.

Reading DOCX with docx2python

from docx2python import docx2python
result = docx2python("example.docx")
text = result.text
images = result.images
print(f"Document has {len(text)} paragraphs and {len(images)} images")--OUTPUT--Document has 5 paragraphs and 2 images

The docx2python library offers a more structured approach, especially when you need more than just text. The docx2python() function returns a result object that neatly organizes the document's contents.

  • You can access all the text via the result.text attribute.
  • It also extracts embedded media, like images, which you can find in result.images.

This makes it a powerful choice for processing documents that contain both text and other media elements.

Using zipfile to extract XML content

import zipfile
from xml.etree.ElementTree import XML
with zipfile.ZipFile("example.docx") as docx:
content = docx.read("word/document.xml")
xml_content = XML(content)
print(f"XML namespace: {xml_content.tag.split('}')[0][1:]}")--OUTPUT--XML namespace: http://schemas.openxmlformats.org/wordprocessingml/2006/main

Under the hood, a DOCX file is just a ZIP archive containing XML files. This means you can use Python's built-in zipfile module to access its contents directly. It’s a lower-level approach that gives you full control over the document's raw structure.

  • The code opens the DOCX file with zipfile.ZipFile and reads word/document.xml, which is where the main body text lives.
  • Once you have the raw XML, you can parse it with a library like xml.etree.ElementTree to programmatically navigate the document's nodes and text.

Advanced techniques and optimizations

Moving beyond basic text extraction, you can leverage Python to parse a document's structure, manage formatting, and efficiently process multiple files in bulk.

Parsing document structure with python-docx

from docx import Document
doc = Document("example.docx")
tables = len(doc.tables)
sections = len(doc.sections)
paragraphs = len(doc.paragraphs)
print(f"Document structure: {paragraphs} paragraphs, {tables} tables, {sections} sections")--OUTPUT--Document structure: 15 paragraphs, 2 tables, 1 sections

The python-docx library lets you see a document's underlying structure. After loading a file, the Document object provides access to its main components, allowing you to count the number of tables, sections, and paragraphs.

  • doc.tables gives you a list of all tables in the document.
  • doc.sections contains objects that define page layout properties like margins and orientation.
  • doc.paragraphs holds all the paragraph objects, which you can iterate through for text.

Using len() on these attributes is a quick way to get a high-level overview of the document's layout.

Working with document styles and formatting

from docx import Document
doc = Document("example.docx")
styles = set()
for paragraph in doc.paragraphs:
styles.add(paragraph.style.name)
print(f"Document uses {len(styles)} styles: {', '.join(styles)}")--OUTPUT--Document uses 3 styles: Normal, Heading 1, Title

You can also analyze a document's formatting with python-docx. The code iterates through each paragraph and accesses its style.name attribute to identify the applied style. It’s a great way to gather a unique list of all styles present, since using a set automatically handles duplicates.

  • This lets you see what styles, like "Normal" or "Heading 1", are used.
  • You can use this information to standardize document formats or extract text only from specific styles.

Batch processing multiple DOCX files

import os
from docx import Document

docx_files = [f for f in os.listdir('.') if f.endswith('.docx')]
total_paragraphs = 0
for file in docx_files:
doc = Document(file)
total_paragraphs += len(doc.paragraphs)
print(f"Processed {len(docx_files)} files with {total_paragraphs} paragraphs in total")--OUTPUT--Processed 3 files with 42 paragraphs in total

You can efficiently process multiple DOCX files by combining Python's os module with python-docx. The code first uses a list comprehension with os.listdir('.') to find all files in the current directory ending with the .docx extension.

  • It then loops through this list of filenames.
  • Inside the loop, each file is opened, and the number of its paragraphs is added to a running total.

It’s a powerful way to automate tasks like data aggregation or analysis across a large collection of documents.

Move faster with Replit

Replit is an AI-powered development platform that transforms natural language into working applications. It's designed to help you build software directly from your ideas, handling the complex parts of development so you can focus on the concept.

With the techniques you've learned for parsing DOCX files, you can use Replit Agent to turn them into production-ready tools. Describe what you want to build, and the agent creates it—complete with databases, APIs, and deployment.

  • Build a resume screening tool that batch processes DOCX resumes, extracts key skills from the text, and flags candidates.
  • Create a content migration utility that reads DOCX files, extracts text and images using methods like those in docx2python, and converts them into a clean HTML or Markdown format.
  • Deploy a document compliance checker that analyzes DOCX files for corporate branding by verifying paragraph styles like 'Heading 1' and 'Normal' are used correctly.

You can bring your own document processing ideas to life. Describe your app to Replit Agent, and it will write the code, test it, and handle deployment automatically.

Common errors and challenges

When working with DOCX files in Python, you might run into a few common errors, but they're typically simple to solve.

A FileNotFoundError is one of the most frequent issues. It simply means Python can't locate the document you're trying to open. To fix this, you should double-check that the filename and path are spelled correctly. Ensure the DOCX file is in the same directory as your script, or provide the full, absolute path to its location.

You may also find your extracted text contains empty strings. This usually happens when the original Word document has blank lines, which libraries read as paragraphs without any text. To get cleaner output, you can filter these out during extraction by adding a simple check to see if a paragraph's text is empty before processing it.

Permission errors can occur when you try to save or modify a document. This often happens if the file is already open in another application, like Microsoft Word, or if you don't have the necessary write permissions for the folder. The solution is usually to close the file in any other programs or verify that your user account has permission to write to the target directory.

Handling FileNotFoundError when opening documents

This error is straightforward: Python can't find the file you're trying to open. It's often caused by a simple typo in the filename or because the document isn't in the expected directory. The following code triggers this error intentionally.

from docx import Document
doc = Document("nonexistent_file.docx") # This file doesn't exist
print(f"Document has {len(doc.paragraphs)} paragraphs")

The code calls Document("nonexistent_file.docx"), but since that file doesn't exist, Python raises a FileNotFoundError and stops the script. The following example shows how you can handle this error to prevent the program from crashing.

from docx import Document
import os

filename = "nonexistent_file.docx"
try:
if os.path.exists(filename):
doc = Document(filename)
print(f"Document has {len(doc.paragraphs)} paragraphs")
else:
print(f"File '{filename}' not found")
except Exception as e:
print(f"Error opening document: {e}")

To prevent a FileNotFoundError, you can proactively check if the file exists before trying to open it. This solution uses os.path.exists() to confirm the file is present before the document is processed.

  • If the function returns True, the script proceeds to open the document.
  • If not, it prints a user-friendly message instead of crashing.

This is a good practice whenever your script relies on files that might be missing or moved.

Handling empty paragraphs when extracting text

You might notice your extracted text contains empty strings. This is a common issue caused by blank lines in the original document, which are read as paragraphs without text. Appending each para.text can clutter your data. The code below shows this in action.

from docx import Document
doc = Document("example.docx")
full_text = []
for para in doc.paragraphs:
full_text.append(para.text) # This might add empty strings
text = ' '.join(full_text) # Can lead to extra spaces
print(f"Extracted text length: {len(text)}")

The loop appends every para.text to the list, including empty strings from blank lines in the document. Joining these later can introduce unwanted whitespace, making the extracted text messy. The following example shows how to fix this.

from docx import Document
doc = Document("example.docx")
full_text = []
for para in doc.paragraphs:
if para.text.strip(): # Only add non-empty paragraphs
full_text.append(para.text)
text = ' '.join(full_text)
print(f"Extracted text length: {len(text)}")

To get cleaner text, you can filter out empty paragraphs during extraction, which is useful when documents have blank lines for formatting. The fix is to add a conditional check inside the loop.

  • The condition if para.text.strip(): checks if a paragraph has content.
  • The .strip() method removes all leading and trailing whitespace.
  • If a paragraph is empty or just contains spaces, it's skipped.

This simple step prevents unwanted empty strings from cluttering your data.

Troubleshooting permissions errors when saving documents

A permission error can stop you from saving changes to a DOCX file. This often happens if the file is already open in another program, like Microsoft Word, or if you don't have write access to the folder. The code below demonstrates how this error can occur when you try to save a document with doc.save() while it's potentially locked.

from docx import Document
doc = Document("example.docx")
doc.add_paragraph("New content added")
doc.save("example.docx") # Trying to save to the same file (may be locked)
print("Document saved successfully")

The code attempts to save changes back to the original file with doc.save("example.docx"). This will fail with a permission error if the file is locked. The next example shows how you can handle this situation gracefully.

from docx import Document
import os

input_file = "example.docx"
output_file = "example_modified.docx"
try:
doc = Document(input_file)
doc.add_paragraph("New content added")
doc.save(output_file)
print(f"Document saved successfully as {output_file}")
except PermissionError:
print(f"Cannot save file. Check if {output_file} is open in another program")

A good way to prevent permission errors is to save your changes to a new file instead of overwriting the original. This avoids conflicts if the source document is locked.

  • The code uses a try...except PermissionError block to catch potential issues when calling doc.save().
  • If an error occurs, it prints a helpful message instead of crashing.

This approach is useful whenever you're modifying files that might be open in another program.

Real-world applications

Putting these techniques into practice, you can automate complex document workflows, from extracting keywords to generating custom reports.

Extracting keywords from documents with Counter

Once you have the text, you can use Python’s Counter class to perform a quick frequency analysis and find the document's main keywords.

from docx import Document
import re
from collections import Counter

doc = Document("example.docx")
text = ' '.join([para.text for para in doc.paragraphs])
words = re.findall(r'\b[a-zA-Z]{3,15}\b', text.lower())
most_common = Counter(words).most_common(5)
print(f"Top 5 keywords: {most_common}")

This code combines several tools to process the document's text. After joining all paragraphs into a single string, it uses a regular expression to build a list of words.

  • The pattern r'\b[a-zA-Z]{3,15}\b' specifically looks for words that are 3 to 15 letters long, which helps filter out noise.
  • The collections.Counter object then tallies the frequency of each word in the list.
  • Finally, .most_common(5) retrieves the five words that appear most often, giving you a quick summary of the text's vocabulary.

Generating automated reports with python-docx

You can also use python-docx to build documents from the ground up, making it a powerful tool for automatically generating formatted reports from structured data.

from docx import Document
import pandas as pd

data = pd.DataFrame({'Quarter': ['Q1', 'Q2', 'Q3', 'Q4'],
'Revenue': [125000, 156000, 142000, 189000]})
doc = Document()
doc.add_heading('Quarterly Financial Report', 0)
table = doc.add_table(rows=1, cols=2)
table.style = 'Table Grid'
table.rows[0].cells[0].text = 'Quarter'
table.rows[0].cells[1].text = 'Revenue ($)'
for _, row in data.iterrows():
cells = table.add_row().cells
cells[0].text = row['Quarter']
cells[1].text = f"${row['Revenue']:,}"
doc.save('financial_report.docx')
print("Financial report generated successfully with quarterly revenue data")

This script automates report creation by converting a pandas DataFrame into a formatted Word document. It initializes a blank document using Document() and adds a main title with doc.add_heading().

  • A table is created, styled, and populated with headers.
  • The code then iterates through the DataFrame, adding a new row to the table for each entry and filling it with the corresponding data.

Finally, doc.save() writes the in-memory document object to a .docx file on your disk.

Get started with Replit

Turn your new skills into a real tool. Tell Replit Agent: “Build a web app to convert DOCX files to HTML” or “Create a script that extracts tables from DOCX reports and saves them to a CSV.”

The agent writes the code, tests for errors, and deploys your app. Start building with Replit to bring your document processing ideas to life.

Get started free

Create and deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.

Get started for free

Create & deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.