참조 : https://python.langchain.com/docs/modules/model_io/llms/llm_caching
Caching
API call 횟수를 줄이기 위해서 caching 사용
Method | Code | Etc. |
(Prepare) | llm = OpenAI(model_name="gpt-3.5-turbo-instruct", n=2, best_of=2) | |
InMemoryCache | set_llm_cache(InMemoryCache()) | |
SQLiteCache | set_llm_cache(SQLiteCache(database_path=".langchain.db")) | DB가 없으면 자동생성 |
(Check) | # The first time, it is not yet in cache, so it should take longer varNow = datetime.datetime.now() result = llm.invoke("Tell me a joke") print('Output>', result, datetime.datetime.now() - varNow) varNow = datetime.datetime.now() result = llm.invoke("Tell me a joke") print('Output>', result, datetime.datetime.now() - varNow) |
첫번째는 |
InMemoryCache
import datetime
from langchain.globals import set_llm_cache
from langchain_openai import OpenAI
# To make the caching really obvious, lets use a slower model.
llm = OpenAI(model_name="gpt-3.5-turbo-instruct", n=2, best_of=2)
from langchain.cache import InMemoryCache
set_llm_cache(InMemoryCache())
# The first time, it is not yet in cache, so it should take longer
varNow = datetime.datetime.now()
result = llm.invoke("Tell me a joke")
print('Output>', result, datetime.datetime.now() - varNow)
varNow = datetime.datetime.now()
result = llm.invoke("Tell me a joke")
print('Output>', result, datetime.datetime.now() - varNow)
Output>
Why don't scientists trust atoms?
Because they make up everything. 0:00:00.769805
Output>
Why don't scientists trust atoms?
Because they make up everything. 0:00:00.000299
- 첫번째 수행은 caching 하기 전, 두번째 수행은 caching 후 에 cache된 결과를 이용 -> 수행시간 차이 확인 필요
SQLite Cache
import datetime
from langchain.globals import set_llm_cache
from langchain_openai import OpenAI
# To make the caching really obvious, lets use a slower model.
llm = OpenAI(model_name="gpt-3.5-turbo-instruct", n=2, best_of=2)
# We can do the same thing with a SQLite cache
from langchain.cache import SQLiteCache
set_llm_cache(SQLiteCache(database_path=".langchain.db"))
# The first time, it is not yet in cache, so it should take longer
varNow = datetime.datetime.now()
# The first time, it is not yet in cache, so it should take longer
result = llm.invoke("Tell me a joke")
print('Output>', result, datetime.datetime.now() - varNow)
varNow = datetime.datetime.now()
result = llm.invoke("Tell me a joke")
print('Output>', result, datetime.datetime.now() - varNow)
Output>
Why couldn't the bicycle stand up by itself? Because it was two-tired. 0:00:00.917748
Output>
Why couldn't the bicycle stand up by itself? Because it was two-tired. 0:00:00.072242
- Cache 부분을 SQLite 이용하는 부분만 다르고 나머지는 동일
- Memory를 사용하는 경우보다 성능은 다소(?) 낮은 수준임
참조 : https://python.langchain.com/docs/modules/model_io/llms/streaming_llm
Streaming
기본적으로 straming 제공함
llm = OpenAI(model="gpt-3.5-turbo-instruct", temperature=0, max_tokens=512)
for chunk in llm.stream("Write me a song about sparkling water."):
print(chunk, end="", flush=True)
- Streaming 방식으로 표시됨
참조 : https://python.langchain.com/docs/modules/model_io/llms/token_usage_tracking
Tracking token usage
OpenAI API에서 tocken usage를 tracking 가능
from langchain_community.callbacks import get_openai_callback
from langchain_openai import OpenAI
llm = OpenAI(model_name="gpt-3.5-turbo-instruct", n=2, best_of=2)
with get_openai_callback() as cb:
result = llm.invoke("Tell me a joke")
print('Output>', cb)
with get_openai_callback() as cb:
result = llm.invoke("Tell me a joke")
result2 = llm.invoke("Tell me a joke")
print('Output>', cb.total_tokens)
Output> Tokens Used: 37
Prompt Tokens: 4
Completion Tokens: 33
Successful Requests: 1
Total Cost (USD): $7.2e-05
Output> 78