diff --git a/Makefile b/Makefile index 3e4b4fb19..f00c9a6be 100644 --- a/Makefile +++ b/Makefile @@ -161,7 +161,7 @@ build-docs: ## Build all documentation @echo "Converting ipynb notebooks to md files..." $(Q)MKDOCS_CI=true uv run python $(GIT_ROOT)/docs/ipynb_to_md.py @echo "Building ragas documentation..." - $(Q)uv run --group docs mkdocs build + $(Q)MKDOCS_CI=false uv run --group docs mkdocs build serve-docs: ## Build and serve documentation locally - $(Q)uv run --group docs mkdocs serve --dirtyreload + $(Q)MKDOCS_CI=false uv run --group docs mkdocs serve --dirtyreload diff --git a/README.md b/README.md index e13b5bdc4..592fbf473 100644 --- a/README.md +++ b/README.md @@ -97,21 +97,39 @@ Available templates: ### Evaluate your LLM App -This is 5 main lines: +This is a simple example evaluating a summary for accuracy: ```python -from ragas import SingleTurnSample -from ragas.metrics import AspectCritic +import asyncio +from ragas.metrics.collections import AspectCritic +from ragas.llms import llm_factory +# Setup your LLM +llm = llm_factory("gpt-4o") + +# Create a metric +metric = AspectCritic( + name="summary_accuracy", + definition="Verify if the summary is accurate and captures key information.", + llm=llm +) + +# Evaluate test_data = { "user_input": "summarise given text\nThe company reported an 8% rise in Q3 2024, driven by strong performance in the Asian market. Sales in this region have significantly contributed to the overall growth. Analysts attribute this success to strategic marketing and product localization. The positive trend in the Asian market is expected to continue into the next quarter.", "response": "The company experienced an 8% increase in Q3 2024, largely due to effective marketing strategies and product adaptation, with expectations of continued growth in the coming quarter.", } -evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o")) -metric = AspectCritic(name="summary_accuracy",llm=evaluator_llm, definition="Verify if the summary is accurate.") -await metric.single_turn_ascore(SingleTurnSample(**test_data)) + +score = await metric.ascore( + user_input=test_data["user_input"], + response=test_data["response"] +) +print(f"Score: {score.value}") +print(f"Reason: {score.reason}") ``` +> **Note**: Make sure your `OPENAI_API_KEY` environment variable is set. + Find the complete [Quickstart Guide](https://docs.ragas.io/en/latest/getstarted/evals) ## Want help in improving your AI application using evals? diff --git a/docs/getstarted/evals.md b/docs/getstarted/evals.md index 08de6b4cc..c4520aa39 100644 --- a/docs/getstarted/evals.md +++ b/docs/getstarted/evals.md @@ -2,183 +2,229 @@ The purpose of this guide is to illustrate a simple workflow for testing and evaluating an LLM application with `ragas`. It assumes minimum knowledge in AI application building and evaluation. Please refer to our [installation instruction](./install.md) for installing `ragas` +!!! tip "Get a Working Example" + The fastest way to see these concepts in action is to create a project using the quickstart command: -## Evaluation + ```sh + ragas quickstart rag_eval + ``` -In this guide, you will evaluate a **text summarization pipeline**. The goal is to ensure that the output summary accurately captures all the key details specified in the text, such as growth figures, market insights, and other essential information. + This generates a complete project with sample code. Follow along with this guide to understand what's happening in your generated code. -`ragas` offers a variety of methods for analyzing the performance of LLM applications, referred to as [metrics](../concepts/metrics/available_metrics/index.md). Each metric requires a predefined set of data points, which it uses to calculate scores that indicate performance. + ```sh + cd rag_eval + ``` -### Evaluating using a Non-LLM Metric + Let's get started -Here is a simple example that uses `BleuScore` to score a summary: +## Project Structure -```python -from ragas import SingleTurnSample -from ragas.metrics import BleuScore - -test_data = { - "user_input": "summarise given text\nThe company reported an 8% rise in Q3 2024, driven by strong performance in the Asian market. Sales in this region have significantly contributed to the overall growth. Analysts attribute this success to strategic marketing and product localization. The positive trend in the Asian market is expected to continue into the next quarter.", - "response": "The company experienced an 8% increase in Q3 2024, largely due to effective marketing strategies and product adaptation, with expectations of continued growth in the coming quarter.", - "reference": "The company reported an 8% growth in Q3 2024, primarily driven by strong sales in the Asian market, attributed to strategic marketing and localized products, with continued growth anticipated in the next quarter." -} -metric = BleuScore() -test_data = SingleTurnSample(**test_data) -metric.single_turn_score(test_data) -``` +Here's what gets created for you: -Output -``` -0.137 +```sh +rag_eval/ +├── README.md # Quick start guide for your project +├── evals.py # Your evaluation code (metrics + datasets) +├── rag.py # Your RAG/LLM application +└── evals/ # Evaluation artifacts + ├── datasets/ # Test data files (edit these to add more test cases) + ├── experiments/ # Results from running evaluations + └── logs/ # Evaluation execution logs ``` -Here we used: +**Key files to focus on:** -- A test sample containing `user_input`, `response` (the output from the LLM), and `reference` (the expected output from the LLM) as data points to evaluate the summary. -- A non-LLM metric called [BleuScore](../concepts/metrics/available_metrics/traditional.md#bleu-score) +- **`evals.py`** - Where you define metrics and load test data (we'll explore this next) +- **`rag.py`** - Your application code (query engine, retrieval, etc.) +- **`evals/datasets/`** - Add your test cases here as CSV or JSON files +## Understanding the Code -As you may observe, this approach has two key limitations: +In your generated project's `evals.py` file, you'll see two key patterns for evaluation: -- **Time-consuming preparation:** Evaluating the application requires preparing the expected output (`reference`) for each input, which can be both time-consuming and challenging. +1. **Metrics** - Functions that score your application's output +2. **Datasets** - Test data that your application is evaluated against -- **Inaccurate scoring:** Even though the `response` and `reference` are similar, the output score was low. This is a known limitation of non-LLM metrics like `BleuScore`. +`ragas` offers a variety of evaluation methods, referred to as [metrics](../concepts/metrics/available_metrics/index.md). Let's walk through the most common ones you'll encounter. +### Custom Evaluation with LLMs -!!! info - A **non-LLM metric** refers to a metric that does not rely on an LLM for evaluation. +In your generated project, you'll see the `DiscreteMetric` - a flexible metric that uses an LLM to evaluate based on any criteria you define: -To address these issues, let's try an LLM-based metric. +```python +from ragas.metrics import DiscreteMetric +from ragas.llms import llm_factory +# Create your evaluator LLM +evaluator_llm = llm_factory("gpt-4o") -### Evaluating using a LLM-based Metric +# Define a custom metric +my_metric = DiscreteMetric( + name="correctness", + prompt="Check if the response is correct. Return 'pass' or 'fail'.\nResponse: {response}\nExpected: {expected}", + allowed_values=["pass", "fail"], +) +# Use it to score +score = my_metric.score( + llm=evaluator_llm, + response="The capital of France is Paris", + expected="Paris" +) +print(f"Score: {score.value}") # Output: 'pass' +``` -**Choose your LLM** ---8<-- -choose_evaluator_llm.md ---8<-- +What you see in your generated `evals.py` lets you define evaluation logic that matters for your application. Learn more about [custom metrics](../concepts/metrics/index.md). -**Evaluation** +### Choosing Your Evaluator LLM -Here we will use [AspectCritic](../concepts/metrics/available_metrics/aspect_critic.md), which is an LLM based metric that outputs pass/fail given the evaluation criteria. +Your evaluation metrics need an LLM to score your application. Ragas works with **any LLM provider** through the `llm_factory`. Your quickstart project uses OpenAI by default, but you can easily swap to any provider by updating the LLM creation in your `evals.py`: -```python -from ragas import SingleTurnSample -from ragas.metrics import AspectCritic +=== "OpenAI" + Set your OpenAI API key: -test_data = { - "user_input": "summarise given text\nThe company reported an 8% rise in Q3 2024, driven by strong performance in the Asian market. Sales in this region have significantly contributed to the overall growth. Analysts attribute this success to strategic marketing and product localization. The positive trend in the Asian market is expected to continue into the next quarter.", - "response": "The company experienced an 8% increase in Q3 2024, largely due to effective marketing strategies and product adaptation, with expectations of continued growth in the coming quarter.", -} + ```sh + export OPENAI_API_KEY="your-openai-key" + ``` -metric = AspectCritic(name="summary_accuracy",llm=evaluator_llm, definition="Verify if the summary is accurate.") -test_data = SingleTurnSample(**test_data) -await metric.single_turn_ascore(test_data) + In your `evals.py`: -``` + ```python + from ragas.llms import llm_factory -Output -``` -1 -``` + llm = llm_factory("gpt-4o") + ``` -Success! Here 1 means pass and 0 means fail + The quickstart project already sets this up for you! -!!! info - There are many other types of metrics that are available in `ragas` (with and without `reference`), and you may also create your own metrics if none of those fits your case. To explore this more checkout [more on metrics](../concepts/metrics/index.md). +=== "Anthropic Claude" + Set your Anthropic API key: -### Evaluating on a Dataset + ```sh + export ANTHROPIC_API_KEY="your-anthropic-key" + ``` -In the examples above, we used only a single sample to evaluate our application. However, evaluating on just one sample is not robust enough to trust the results. To ensure the evaluation is reliable, you should add more test samples to your test data. + In your `evals.py`: -Here, we’ll load a dataset from Hugging Face Hub, but you can load data from any source, such as production logs or other datasets. Just ensure that each sample includes all the required attributes for the chosen metric. + ```python + from ragas.llms import llm_factory -In our case, the required attributes are: -- **`user_input`**: The input provided to the application (here the input text report). -- **`response`**: The output generated by the application (here the generated summary). + llm = llm_factory("claude-3-5-sonnet-20241022", provider="anthropic") + ``` -For example +=== "Google Cloud" + Set up your Google credentials: -```python -[ - # Sample 1 - { - "user_input": "summarise given text\nThe Q2 earnings report revealed a significant 15% increase in revenue, ...", - "response": "The Q2 earnings report showed a 15% revenue increase, ...", - }, - # Additional samples in the dataset - ...., - # Sample N - { - "user_input": "summarise given text\nIn 2023, North American sales experienced a 5% decline, ...", - "response": "Companies are strategizing to adapt to market challenges and ...", - } -] -``` + ```sh + export GOOGLE_API_KEY="your-google-api-key" + ``` -```python -from datasets import load_dataset -from ragas import EvaluationDataset -eval_dataset = load_dataset("explodinggradients/earning_report_summary",split="train") -eval_dataset = EvaluationDataset.from_hf_dataset(eval_dataset) -print("Features in dataset:", eval_dataset.features()) -print("Total samples in dataset:", len(eval_dataset)) -``` + In your `evals.py`: -Output -``` -Features in dataset: ['user_input', 'response'] -Total samples in dataset: 50 -``` - -Evaluate using dataset + ```python + from ragas.llms import llm_factory -```python -from ragas import evaluate + llm = llm_factory("gemini-1.5-pro", provider="google") + ``` -results = evaluate(eval_dataset, metrics=[metric]) -results -``` +=== "Local Models (Ollama)" + Install and run Ollama locally, then in your `evals.py`: -!!! tip "Async Usage" - For production async applications, use `aevaluate()` to avoid event loop conflicts: ```python - from ragas import aevaluate + from ragas.llms import llm_factory - # In an async function - results = await aevaluate(eval_dataset, metrics=[metric]) + llm = llm_factory( + "mistral", + provider="ollama", + base_url="http://localhost:11434" # Default Ollama URL + ) ``` - Or disable nest_asyncio in sync code: +=== "Custom / Other Providers" + For any LLM with OpenAI-compatible API: + ```python - results = evaluate(eval_dataset, metrics=[metric], allow_nest_asyncio=False) + from ragas.llms import llm_factory + + llm = llm_factory( + "model-name", + api_key="your-api-key", + base_url="https://your-api-endpoint" + ) ``` -Output -``` -{'summary_accuracy': 0.84} -``` + For more details, learn about [LLM integrations](../concepts/metrics/index.md). -This score shows that out of all the samples in our test data, only 84% of summaries passes the given evaluation criteria. Now, **It's -important to see why is this the case**. +### Using Pre-Built Metrics -Export the sample level scores to pandas dataframe +`ragas` comes with pre-built metrics for common evaluation tasks. For example, [AspectCritic](../concepts/metrics/available_metrics/aspect_critic.md) evaluates any aspect of your output: ```python -results.to_pandas() -``` +from ragas.metrics.collections import AspectCritic +from ragas.llms import llm_factory + +# Setup your evaluator LLM +evaluator_llm = llm_factory("gpt-4o") + +# Use a pre-built metric +metric = AspectCritic( + name="summary_accuracy", + definition="Verify if the summary is accurate and captures key information.", + llm=evaluator_llm +) -Output +# Score your application's output +score = await metric.ascore( + user_input="Summarize this text: ...", + response="The summary of the text is..." +) +print(f"Score: {score.value}") # 1 = pass, 0 = fail +print(f"Reason: {score.reason}") ``` - user_input response summary_accuracy -0 summarise given text\nThe Q2 earnings report r... The Q2 earnings report showed a 15% revenue in... 1 -1 summarise given text\nIn 2023, North American ... Companies are strategizing to adapt to market ... 1 -2 summarise given text\nIn 2022, European expans... Many companies experienced a notable 15% growt... 1 -3 summarise given text\nSupply chain challenges ... Supply chain challenges in North America, caus... 1 + +Pre-built metrics like this save you from defining evaluation logic from scratch. Explore [all available metrics](../concepts/metrics/available_metrics/index.md). + +!!! info + There are many other types of metrics that are available in `ragas` (with and without `reference`), and you may also create your own metrics if none of those fits your case. To explore this more checkout [more on metrics](../concepts/metrics/index.md). + +### Evaluating on a Dataset + +In your quickstart project, you'll see in the `load_dataset()` function, which creates test data with multiple samples: + +```python +from ragas import Dataset + +# Create a dataset with multiple test samples +dataset = Dataset( + name="test_dataset", + backend="local/csv", # Can also use JSONL, Google Drive, or in-memory + root_dir=".", +) + +# Add samples to the dataset +data_samples = [ + { + "user_input": "What is ragas?", + "response": "Ragas is an evaluation framework...", + "expected": "Ragas provides objective metrics..." + }, + { + "user_input": "How do metrics work?", + "response": "Metrics score your application...", + "expected": "Metrics evaluate performance..." + }, +] + +for sample in data_samples: + dataset.append(sample) + +# Save to disk +dataset.save() ``` -Viewing the sample-level results in a CSV file, as shown above, is fine for quick checks but not ideal for detailed analysis or comparing results across evaluation runs. +This gives you multiple test cases instead of evaluating one example at a time. Learn more about [datasets and experiments](../concepts/components/eval_dataset.md). + +Your generated project includes sample data in the `evals/datasets/` folder - you can edit those files to add more test cases. ### Want help in improving your AI application using evals? diff --git a/docs/getstarted/index.md b/docs/getstarted/index.md index 26ae9923a..15016bb7e 100644 --- a/docs/getstarted/index.md +++ b/docs/getstarted/index.md @@ -11,6 +11,7 @@ If you have any questions about Ragas, feel free to join and ask in the `#questi Let's get started! + - [Evaluate your first AI app](./evals.md) - [Run ragas metrics for evaluating RAG](rag_eval.md) - [Generate test data for evaluating RAG](rag_testset_generation.md) diff --git a/docs/getstarted/quickstart.md b/docs/getstarted/quickstart.md new file mode 100644 index 000000000..0f6fe9fb7 --- /dev/null +++ b/docs/getstarted/quickstart.md @@ -0,0 +1,143 @@ +# Quick Start: Get Evaluations Running in a flash + +Get started with Ragas in seconds. No installation needed! Just set your API key and run one command. + +## 1. Set Your API Key + +Choose your LLM provider: + +```sh +# OpenAI (default) +export OPENAI_API_KEY="your-openai-key" + +# Or use Anthropic Claude +export ANTHROPIC_API_KEY="your-anthropic-key" +``` + +## 2. Create Your Project + +Create a complete project with a single command using `uvx` (no installation required): + +```sh +uvx ragas quickstart rag_eval +cd rag_eval +``` + +That's it! You now have a fully configured evaluation project ready to use. + +## Project Structure + +Your generated project includes: + +```sh +rag_eval/ +├── README.md # Project documentation +├── evals.py # Evaluation configuration +├── rag.py # Your LLM application +└── evals/ + ├── datasets/ # Test data (CSV/JSON files) + ├── experiments/ # Evaluation results + └── logs/ # Execution logs +``` + +## Run Evaluations + +### Run the Evaluation + +Execute the evaluation on your dataset: + +```sh +uvx ragas evals evals.py --dataset test_data --metrics faithfulness,answer_correctness +``` + +Or, if you prefer to use Python directly (after installing ragas): + +```sh +python evals.py +``` + +This will: +- Load test data from `evals/datasets/` +- Evaluate your application using pre-configured metrics +- Save results to `evals/experiments/` + +### View Results + +Results are saved as CSV files in `evals/experiments/`: + +```python +import pandas as pd + +# Load and view results +df = pd.read_csv('evals/experiments/results.csv') +print(df[['user_input', 'response', 'faithfulness', 'answer_correctness']]) + +# Quick statistics +print(f"Average Faithfulness: {df['faithfulness'].mean():.2f}") +print(f"Average Correctness: {df['answer_correctness'].mean():.2f}") +``` + +... + +## Customize Your Evaluation + +### Add More Test Cases + +Edit `evals/datasets/test_data.csv`: + +```csv +user_input,response,reference +What is Ragas?,Ragas is an evaluation framework for LLM applications,Ragas provides objective metrics for evaluating LLM applications +How do metrics work?,Metrics score your LLM outputs,Metrics evaluate the quality and performance of LLM responses +``` + +### Change the LLM Provider + +In `evals.py`, update the LLM configuration: + +```python +from ragas.llms import llm_factory + +# Use Anthropic Claude +llm = llm_factory("claude-3-5-sonnet-20241022", provider="anthropic") + +# Use Google Gemini +llm = llm_factory("gemini-1.5-pro", provider="google") + +# Use local Ollama +llm = llm_factory("mistral", provider="ollama", base_url="http://localhost:11434") +``` + +### Select Different Metrics + +In `evals.py`, modify the metrics list: + +```python +from ragas.metrics import ( + Faithfulness, # Does response match context? + AnswerCorrectness, # Is the answer correct? + ContextPrecision, # Is retrieved context relevant? + ContextRecall, # Is all needed context retrieved? +) + +# Use only specific metrics +metrics = [ + Faithfulness(), + AnswerCorrectness(), +] +``` + +## What's Next? + +- **Learn the concepts**: Read the [Evaluate a Simple LLM Application](evals.md) guide for deeper understanding +- **Custom metrics**: [Write your own metrics](../howtos/customizations/metrics/_write_your_own_metric.md) tailored to your use case +- **Production integration**: [Integrate evaluations into your CI/CD pipeline](../howtos/index.md) +- **RAG evaluation**: Evaluate [RAG systems](rag_eval.md) with specialized metrics +- **Agent evaluation**: Explore [AI agent evaluation](../howtos/applications/text2sql.md) +- **Test data generation**: [Generate synthetic test datasets](rag_testset_generation.md) for your evaluations + +## Getting Help + +- 📚 [Full Documentation](https://docs.ragas.io/) +- 💬 [Join our Discord Community](https://discord.gg/5djav8GGNZ) +- 🐛 [Report Issues](https://github.com/explodinggradients/ragas/issues) diff --git a/docs/howtos/applications/align-llm-as-judge.md b/docs/howtos/applications/align-llm-as-judge.md index 6ef8edd46..dc3865a70 100644 --- a/docs/howtos/applications/align-llm-as-judge.md +++ b/docs/howtos/applications/align-llm-as-judge.md @@ -185,7 +185,7 @@ async def judge_experiment( ```python import os from openai import AsyncOpenAI -from ragas.llms import instructor_llm_factory +from ragas.llms import llm_factory from ragas_examples.judge_alignment import load_dataset # Load dataset @@ -194,7 +194,7 @@ print(f"Dataset loaded with {len(dataset)} samples") # Initialize LLM client openai_client = AsyncOpenAI(api_key=os.environ.get("OPENAI_API_KEY")) -llm = instructor_llm_factory("openai", model="gpt-4o-mini", client=openai_client) +llm = llm_factory("gpt-4o-mini", client=openai_client) # Run the experiment results = await judge_experiment.arun( diff --git a/docs/howtos/integrations/_haystack.md b/docs/howtos/integrations/_haystack.md index ba99746bb..be4435f45 100644 --- a/docs/howtos/integrations/_haystack.md +++ b/docs/howtos/integrations/_haystack.md @@ -50,7 +50,7 @@ docs = [Document(content=doc) for doc in dataset] ```python -from haystack.components.embedders import OpenAITextEmbedder, OpenAIDocumentEmbedder +from haystack.components.embedders import OpenAIDocumentEmbedder, OpenAITextEmbedder document_embedder = OpenAIDocumentEmbedder(model="text-embedding-3-small") text_embedder = OpenAITextEmbedder(model="text-embedding-3-small") @@ -133,8 +133,8 @@ Make sure to include all relevant data for each metric to ensure accurate evalua ```python from haystack_integrations.components.evaluators.ragas import RagasEvaluator - from langchain_openai import ChatOpenAI + from ragas.llms import LangchainLLMWrapper from ragas.metrics import AnswerRelevancy, ContextPrecision, Faithfulness @@ -252,7 +252,7 @@ In the example below, we will define two custom Ragas metrics: ```python -from ragas.metrics import RubricsScore, AspectCritic +from ragas.metrics import AspectCritic, RubricsScore SportsRelevanceMetric = AspectCritic( name="sports_relevance_metric", diff --git a/docs/howtos/integrations/_helicone.md b/docs/howtos/integrations/_helicone.md index 318f2d80b..e4e781577 100644 --- a/docs/howtos/integrations/_helicone.md +++ b/docs/howtos/integrations/_helicone.md @@ -25,16 +25,18 @@ First, let's install the required packages and set up our environment. ```python import os + from datasets import Dataset + from ragas import evaluate -from ragas.metrics import faithfulness, answer_relevancy, context_precision from ragas.integrations.helicone import helicone_config # import helicone_config - +from ragas.metrics import answer_relevancy, context_precision, faithfulness # Set up Helicone -helicone_config.api_key = ( +HELICONE_API_KEY = ( "your_helicone_api_key_here" # Replace with your actual Helicone API key ) +helicone_config.api_key = HELICONE_API_KEY os.environ["OPENAI_API_KEY"] = ( "your_openai_api_key_here" # Replace with your actual OpenAI API key ) diff --git a/docs/howtos/integrations/_langchain.md b/docs/howtos/integrations/_langchain.md index 0a31b98cf..475565c0b 100644 --- a/docs/howtos/integrations/_langchain.md +++ b/docs/howtos/integrations/_langchain.md @@ -13,8 +13,9 @@ With this integration you can easily evaluate your QA chains with the metrics of ```python # attach to the existing event loop when using jupyter notebooks -import nest_asyncio import os + +import nest_asyncio import openai from dotenv import load_dotenv @@ -35,9 +36,9 @@ First lets load the dataset. We are going to build a generic QA system over the ```python -from langchain_community.document_loaders import TextLoader -from langchain.indexes import VectorstoreIndexCreator from langchain.chains import RetrievalQA +from langchain.indexes import VectorstoreIndexCreator +from langchain_community.document_loaders import TextLoader from langchain_openai import ChatOpenAI loader = TextLoader("./nyc_wikipedia/nyc_text.txt") @@ -155,10 +156,10 @@ result["result"] ```python from ragas.langchain.evalchain import RagasEvaluatorChain from ragas.metrics import ( - faithfulness, answer_relevancy, context_precision, context_recall, + faithfulness, ) # create evaluation chains diff --git a/docs/howtos/integrations/_langsmith.md b/docs/howtos/integrations/_langsmith.md index d936c1f43..cedbe71e3 100644 --- a/docs/howtos/integrations/_langsmith.md +++ b/docs/howtos/integrations/_langsmith.md @@ -26,9 +26,9 @@ Once langsmith is setup, just run the evaluations as your normally would ```python from datasets import load_dataset -from ragas.metrics import context_precision, answer_relevancy, faithfulness -from ragas import evaluate +from ragas import evaluate +from ragas.metrics import answer_relevancy, context_precision, faithfulness fiqa_eval = load_dataset("explodinggradients/fiqa", "ragas_eval") diff --git a/docs/howtos/integrations/_zeno.md b/docs/howtos/integrations/_zeno.md index ebf2a6ca0..1386313f9 100644 --- a/docs/howtos/integrations/_zeno.md +++ b/docs/howtos/integrations/_zeno.md @@ -13,7 +13,7 @@ pip install zeno-client Next, create an account at [hub.zenoml.com](https://hub.zenoml.com) and generate an API key on your [account page](https://hub.zenoml.com/account). -We can now pick up the evaluation where we left off at the [Getting Started](./../../getstarted/index.md) guide: +We can now pick up the evaluation where we left off at the [Getting Started](../../getstarted/evaluation.md) guide: ```python @@ -21,6 +21,8 @@ import os import pandas as pd from datasets import load_dataset +from zeno_client import ZenoClient, ZenoMetric + from ragas import evaluate from ragas.metrics import ( answer_relevancy, @@ -28,7 +30,6 @@ from ragas.metrics import ( context_recall, faithfulness, ) -from zeno_client import ZenoClient, ZenoMetric ``` diff --git a/mkdocs.yml b/mkdocs.yml index 61883d0ce..ff3cc9408 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -11,6 +11,7 @@ nav: - "": index.md - 🚀 Get Started: - getstarted/index.md + # - Quick Start: getstarted/quickstart.md - Installation: getstarted/install.md - Evaluate your first LLM App: getstarted/evals.md - Evaluate a simple RAG: getstarted/rag_eval.md @@ -254,7 +255,8 @@ extra: provider: google property: !ENV GOOGLE_ANALYTICS_KEY plugins: - - social + - social: + enabled: !ENV [MKDOCS_CI, true] - search - git-revision-date-localized: enabled: !ENV [MKDOCS_CI, false]