참조 : https://python.langchain.com/docs/modules/data_connection/document_transformers/
Text Splitter
다음과 같이 동작
- 아주 작은 chunk 단위 (문장)로 나눔
- 특정 크기에 도달할때 까지 작은 단위를 큰 단위의 chunk 단위로 결합
- 해당 크기에 도달하면 해당 chunk를 고유한 텍스트 조작으로 만든 다음 약간 겹치는 새로운 Text chunk를 만듦
다음 기준으로 customize 가능
- How the text split
- How the chunk size is measured
Types of Text Splitter
Name | Split On | Add Metadata | Description |
Recursive | A list of user defined character |
텍스트를 재귀적으로 분할 > 관련된 텍스트 조각을 서로 옆에 유지하려는 목적으로 사용 * 권장되는 방법 |
|
HTML | HTML specific characters |
v | HTML 특정문자 기반으로 분할. 해당 chunck가 어디에서 왔는지에 대한 정보를 추가함 |
Markdown | Markdown specific characters |
v | Markdown 관련 문자를 기반으로 분할. 해당 chunck가 어디에서 왔는지에 대한 정보를 추가함 |
Code | Code specific characters |
Language별 문자를 기준으로 분할. 15개 언어 선택 가능 | |
Token | Tokens | 토큰의 텍스트를 분할. 토큰을 측정하는 몇 가지 방법 존재 | |
Character | A user defined character (Single character) |
사용자 정의 문자를 기분으로 분할. 가장 간단한 방법중 하나 | |
Semantic Chunker |
Sentences | 먼저 문장으로 나누고 의미상 충분히 유사한 경우 서로 옆에 있는 항목을 결합하는 방식 | |
AI21 Semantic Text Splitter | Semantics | v | 일관된 텍스트 조간을 형성하고 이에 따라 분할하는 고유한 주제를 식별하는 방식 (?) |
Recursively split by character
List of character 기준으로 split
Number of characters 로 chunk size를 측정
from langchain_text_splitters import RecursiveCharacterTextSplitter
with open("sample/TechRepo-ALL-html_-TXT_0.txt") as f:
state_of_the_union = f.read()
text_splitter = RecursiveCharacterTextSplitter(
# Set a really small chunk size, just to show.
chunk_size=100,
chunk_overlap=20,
length_function=len,
is_separator_regex=False,
)
texts = text_splitter.create_documents([state_of_the_union])
print(texts[0])
print(texts[1])
result = text_splitter.split_text(state_of_the_union)[:2]
print(result)
HTMLHeaderTextSplitter
참조 : https://python.langchain.com/docs/modules/data_connection/document_transformers/HTML_header_metadata
기본 구조를 알 수 있는 단위로 분할 가능 (Markdown, HTML 등 = Structure-aware chunker)
from langchain_text_splitters import HTMLHeaderTextSplitter
html_string = """
<!DOCTYPE html>
<html>
<body>
<div>
<h1>Foo</h1>
<p>Some intro text about Foo.</p>
<div>
<h2>Bar main section</h2>
<p>Some intro text about Bar.</p>
<h3>Bar subsection 1</h3>
<p>Some text about the first subtopic of Bar.</p>
<h3>Bar subsection 2</h3>
<p>Some text about the second subtopic of Bar.</p>
</div>
<div>
<h2>Baz</h2>
<p>Some text about Baz</p>
</div>
<br>
<p>Some concluding text about Foo</p>
</div>
</body>
</html>
"""
headers_to_split_on = [
("h1", "Header 1"),
("h2", "Header 2"),
("h3", "Header 3"),
]
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
html_header_splits = html_splitter.split_text(html_string)
print(html_header_splits)
비교
# Load the document, split it into chunks, embed each chunk and load it into the vector store.
print('Step-1. Load data begins -------------------------------------------------------')
varSetTime = datetime.now()
rawDocuments = TextLoader(SourceFile).load()
print('Step-1. Load data -', datetime.now() - varSetTime )
print('Step-2. Split data begins -----------------------------------------------------')
varSetTime = datetime.now()
def chooseTextSplitter(varMode, varParams):
if varMode == 'Char' : return CharacterTextSplitter (**varParams)
elif varMode == 'Recur': return RecursiveCharacterTextSplitter(**varParams)
elif varMode == 'Token': return CharacterTextSplitter.from_tiktoken_encoder(**varParams)
return None
textSplitter = chooseTextSplitter('Char', {'chunk_size': 100, 'chunk_overlap': 0,'length_function' : len})
documents = textSplitter.split_documents(rawDocuments)
print('------------------------------------------------------------------')
print('Splitter Spe-option Tocken 1st 2nd 3rd')
print('------------------------------------------------------------------')
print('- Character + No-Separator = {:6d} {:6d} {:6d} {:6d}'.format(
len(documents), len(documents[0].page_content), len(documents[1].page_content), len(documents[2].page_content)))
textSplitter = chooseTextSplitter('Char', {'chunk_size': 100, 'chunk_overlap': 0,'length_function' : len, 'separator': ''})
documents = textSplitter.split_documents(rawDocuments)
print('- Character + Separator = {:6d} {:6d} {:6d} {:6d}'.format(
len(documents), len(documents[0].page_content), len(documents[1].page_content), len(documents[2].page_content)))
textSplitter = chooseTextSplitter('Recur', {'chunk_size': 100, 'chunk_overlap': 0,'length_function' : len})
documents = textSplitter.split_documents(rawDocuments)
print('- Recursive = {:6d} {:6d} {:6d} {:6d}'.format(
len(documents), len(documents[0].page_content), len(documents[1].page_content), len(documents[2].page_content)))
textSplitter = chooseTextSplitter('Token', {'chunk_size': 100, 'chunk_overlap': 0,'encoding_name':'cl100k_base'})
documents = textSplitter.split_documents(rawDocuments)
print('- Token + /w cl100k_base = {:6d} {:6d} {:6d} {:6d}'.format(
len(documents), len(documents[0].page_content), len(documents[1].page_content), len(documents[2].page_content)))
print('------------------------------------------------------------------')
Step-1. Load data begins -------------------------------------------------------
Step-1. Load data - 0:00:00.000902
Step-2. Split data begins -----------------------------------------------------
Created a chunk of size 50821, which is longer than the specified 100
Created a chunk of size 553, which is longer than the specified 100
Created a chunk of size 1358, which is longer than the specified 100
Created a chunk of size 4008, which is longer than the specified 100
Created a chunk of size 5640, which is longer than the specified 100
Created a chunk of size 6437, which is longer than the specified 100
Created a chunk of size 560, which is longer than the specified 100
Created a chunk of size 15489, which is longer than the specified 100
Created a chunk of size 7105, which is longer than the specified 100
Created a chunk of size 17701, which is longer than the specified 100
Created a chunk of size 9352, which is longer than the specified 100
Created a chunk of size 5046, which is longer than the specified 100
Created a chunk of size 17139, which is longer than the specified 100
Created a chunk of size 17023, which is longer than the specified 100
Created a chunk of size 16245, which is longer than the specified 100
Created a chunk of size 470, which is longer than the specified 100
------------------------------------------------------------------
Splitter Spe-option Tocken 1st 2nd 3rd
------------------------------------------------------------------
- Character + No-Separator = 17 50821 553 1358
- Character + Separator = 1927 98 100 100
- Recursive + No-Separator = 2250 83 71 98
Created a chunk of size 18722, which is longer than the specified 100
Created a chunk of size 174, which is longer than the specified 100
Created a chunk of size 723, which is longer than the specified 100
Created a chunk of size 1529, which is longer than the specified 100
Created a chunk of size 2277, which is longer than the specified 100
Created a chunk of size 2301, which is longer than the specified 100
Created a chunk of size 174, which is longer than the specified 100
Created a chunk of size 6336, which is longer than the specified 100
Created a chunk of size 2510, which is longer than the specified 100
Created a chunk of size 6457, which is longer than the specified 100
Created a chunk of size 3635, which is longer than the specified 100
Created a chunk of size 1774, which is longer than the specified 100
Created a chunk of size 6909, which is longer than the specified 100
Created a chunk of size 6755, which is longer than the specified 100
Created a chunk of size 6694, which is longer than the specified 100
Created a chunk of size 144, which is longer than the specified 100
- Token + /w cl100k_base = 17 50821 553 1358
------------------------------------------------------------------
Step-2. Split data - 0:00:00.620670
- CharacterSplitter 사용 시
- Separator 옵션을 주지 않는 경우 주어진 Chunck 단위로 잘라지지 않음
- Separator 옵션을 ''로 준 경우 주어진 크기만큼 잘라버림
- Recursive 사용시
- Separator 옵션 사용 불가
- CharacterSplitter를 Separator='' 옵션 줄때와 비슷하게 최대한 주어진 크기만큼 잘라버림
- Token base로 사용 시
- length_function 옵션을 사용 불가
- CharacterSplitter + No-Separate 옵션과 동일한 결과
'ML&DL and LLM' 카테고리의 다른 글
LangChain - 2.6 Retrievers (0) | 2024.04.03 |
---|---|
LangChain - 2.5 Vector stores GetStarted (0) | 2024.04.02 |
LangChain - 2.2 Document loaders (0) | 2024.04.01 |
LangChain - 2.1 Retrieval concept (1) | 2024.03.29 |
LangChain 1.5.1 Types of output parser (0) | 2024.03.29 |