How to load HTML
The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser.
This covers how to load HTML
documents into a LangChain Document objects that we can use downstream.
Parsing HTML files often requires specialized tools. Here we demonstrate parsing via Unstructured and BeautifulSoup4, which can be installed via pip. Head over to the integrations page to find integrations with additional services, such as Azure AI Document Intelligence or FireCrawl.
Loading HTML with Unstructuredβ
%pip install "unstructured[html]"
from langchain_community.document_loaders import UnstructuredHTMLLoader
file_path = "../../../docs/integrations/document_loaders/example_data/fake-content.html"
loader = UnstructuredHTMLLoader(file_path)
data = loader.load()
print(data)
[Document(page_content='My First Heading\n\nMy first paragraph.', metadata={'source': '../../../docs/integrations/document_loaders/example_data/fake-content.html'})]
Loading HTML with BeautifulSoup4β
We can also use BeautifulSoup4
to load HTML documents using the BSHTMLLoader
. This will extract the text from the HTML into page_content
, and the page title as title
into metadata
.
%pip install bs4
from langchain_community.document_loaders import BSHTMLLoader
loader = BSHTMLLoader(file_path)
data = loader.load()
print(data)
[Document(page_content='\nTest Title\n\n\nMy First Heading\nMy first paragraph.\n\n\n', metadata={'source': '../../../docs/integrations/document_loaders/example_data/fake-content.html', 'title': 'Test Title'})]