본문 바로가기

ML&DL and LLM

LangChain - 2.2 Document loaders

참조 : https://python.langchain.com/docs/modules/data_connection/document_loaders/csv

 

CSV | 🦜️🔗 Langchain

A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas.

python.langchain.com

 

 

Document loaders

Source에서 document를 load

 

 

Simplest loader

from langchain_community.document_loaders import TextLoader

loader = TextLoader("./index.md")
result = loader.load()
print(result)

[
    Document(page_content='---\nsidebar_position: 0\n---\n# Document loaders\n\nUse document loaders to load data from a source as `Document`\'s. A `Document` is a piece of text\nand associated metadata. For example, there are document loaders for loading a simple `.txt` file, for loading the text\ncontents of any web page, or even for loading a transcript of a YouTube video.\n\nEvery document loader exposes two methods:\n1. "Load": load documents from the configured source\n2. "Load and split": load documents from the configured source and split them using the passed in text splitter\n\nThey optionally implement:\n\n3. "Lazy load": load documents into memory lazily\n', metadata={'source': '../docs/docs/modules/data_connection/document_loaders/index.md'})
]

 

 

HTML

참조 : https://python.langchain.com/docs/modules/data_connection/document_loaders/html

 

HTML | 🦜️🔗 Langchain

The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser.

python.langchain.com

 

Unstructured HTML Loader

from langchain_community.document_loaders import UnstructuredHTMLLoader
loader = UnstructuredHTMLLoader("example_data/fake-content.html")
data = loader.load()
print(data)

 

 

BeautifulSoup4

# from langchain_community.document_loaders import UnstructuredHTMLLoader
from langchain_community.document_loaders import BSHTMLLoader

loader = BSHTMLLoader("sample/file.html")
data = loader.load()
print(data)

 

 

 

JSON

참조 : https://python.langchain.com/docs/modules/data_connection/document_loaders/json

 

JSON | 🦜️🔗 Langchain

JSON (JavaScript Object Notation) is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays (or other serializable values).

python.langchain.com

 

 

JSONLoader

from langchain_community.document_loaders import JSONLoader
import json
from pathlib import Path
from pprint import pprint


file_path='sample/json_sample.json'
data = json.loads(Path(file_path).read_text())
print(data)

 

jq_schema 사용

from langchain_community.document_loaders import JSONLoader
import json
from pathlib import Path
from pprint import pprint

loader = JSONLoader(
    file_path='sample/json_sample.json',
    jq_schema='.message[].content',
    text_content=False
)

data = loader.load()
print(data)

 

'ML&DL and LLM' 카테고리의 다른 글

LangChain - 2.5 Vector stores GetStarted  (0) 2024.04.02
LangChain - 2.3 Text Splitter  (0) 2024.04.02
LangChain - 2.1 Retrieval concept  (1) 2024.03.29
LangChain 1.5.1 Types of output parser  (0) 2024.03.29
LangChain - 1.3.1 LLM QuickStart  (0) 2024.03.28