docs: comprehensive documentation for NotebookLM-RAG integration

Update documentation to reflect new integration features: README.md: - Add 'Integrazione NotebookLM + RAG' section after Overview - Update DocuMente component section with new endpoints - Add notebooklm_sync.py and notebooklm_indexer.py to architecture - Add integration API examples - Add link to docs/integration.md SKILL.md: - Add RAG Integration to Capabilities table - Update Autonomy Rules with new endpoints - Add RAG Integration section to Quick Reference - Add Sprint 2 changelog with integration features - Update Skill Version to 1.2.0 docs/integration.md (NEW): - Complete integration guide with architecture diagram - API reference for all sync and query endpoints - Usage examples and workflows - Best practices and troubleshooting - Performance considerations and limitations - Roadmap for future features All documentation now accurately reflects the unified NotebookLM + RAG agent capabilities.
2026-04-06 18:01:50 +02:00
parent a5029aef20
commit 568489cae4
3 changed files with 628 additions and 3 deletions
@@ -0,0 +1,436 @@
+# Guida Integrazione NotebookLM + RAG
+
+Questo documento descrive l'integrazione tra **NotebookLM Agent** e **DocuMente RAG**, che permette di eseguire ricerche semantiche (RAG) sui contenuti dei notebook di Google NotebookLM.
+
+---
+
+## Indice
+
+- [Overview](#overview)
+- [Architettura](#architettura)
+- [Come Funziona](#come-funziona)
+- [API Reference](#api-reference)
+- [Esempi di Utilizzo](#esempi-di-utilizzo)
+- [Best Practices](#best-practices)
+- [Troubleshooting](#troubleshooting)
+
+---
+
+## Overview
+
+L'integrazione colma il divario tra **gestione notebook** (NotebookLM Agent) e **ricerca semantica** (DocuMente RAG), permettendo di:
+
+- 🔍 **Ricercare** nei contenuti dei notebook con semantic search
+- 🧠 **Usare LLM multi-provider** per interrogare i notebook
+- 📊 **Combinare** notebook e documenti locali nelle stesse query
+- 🎯 **Filtrare** per notebook specifici
+- ⚡ **Indicizzare** automaticamente i contenuti
+
+### Use Cases
+
+1. **Research Assistant**: "Cosa dicono tutti i miei notebook sull'intelligenza artificiale?"
+2. **Knowledge Mining**: "Trova tutte le fonti che parlano di Python nei miei notebook di programmazione"
+3. **Cross-Notebook Analysis**: "Confronta le conclusioni tra il notebook A e il notebook B"
+4. **Document + Notebook Search**: "Quali informazioni ho sia nei documenti PDF che nei notebook?"
+
+---
+
+## Architettura
+
+```
+┌─────────────────────────────────────────────────────────────────┐
+│                        NotebookLM Agent                         │
+│  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────┐ │
+│  │  Notebooks  │───▶│   Sources   │───▶│   Full Text Get     │ │
+│  └─────────────┘    └─────────────┘    └─────────────────────┘ │
+└─────────────────────────────────────────────────────────────────┘
+                              │
+                              │ Extract Content
+                              ▼
+┌─────────────────────────────────────────────────────────────────┐
+│                   NotebookLMIndexerService                      │
+│  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────┐ │
+│  │   Chunking  │───▶│  Embedding  │───▶│   Metadata Store    │ │
+│  └─────────────┘    └─────────────┘    └─────────────────────┘ │
+└─────────────────────────────────────────────────────────────────┘
+                              │
+                              │ Index to Vector Store
+                              ▼
+┌─────────────────────────────────────────────────────────────────┐
+│                         Qdrant Vector Store                     │
+│  ┌───────────────────────────────────────────────────────────┐  │
+│  │  Collection: "documents"                                  │  │
+│  │  Points with metadata:                                    │  │
+│  │    - notebook_id, source_id, source_title                 │  │
+│  │    - notebook_title, source_type                          │  │
+│  │    - source: "notebooklm"                                 │  │
+│  └───────────────────────────────────────────────────────────┘  │
+└─────────────────────────────────────────────────────────────────┘
+                              │
+                              │ Query with Filters
+                              ▼
+┌─────────────────────────────────────────────────────────────────┐
+│                          RAGService                             │
+│  ┌─────────────┐    ┌─────────────┐    ┌─────────────────────┐ │
+│  │    Query    │───▶│   Search    │───▶│   LLM Generation    │ │
+│  └─────────────┘    └─────────────┘    └─────────────────────┘ │
+└─────────────────────────────────────────────────────────────────┘
+```
+
+---
+
+## Come Funziona
+
+### 1. Sincronizzazione
+
+Quando sincronizzi un notebook:
+
+1. **Estrazione**: Ottiene tutte le fonti dal notebook via `notebooklm-py`
+2. **Full Text**: Recupera il testo completo di ogni fonte (se disponibile)
+3. **Chunking**: Divide i contenuti in chunks di ~1024 caratteri
+4. **Embedding**: Genera embeddings vettoriali usando OpenAI
+5. **Storage**: Salva in Qdrant con metadata completi
+
+### 2. Metadata Structure
+
+Ogni chunk memorizzato contiene:
+
+```json
+{
+  "text": "contenuto del chunk...",
+  "notebook_id": "uuid-del-notebook",
+  "source_id": "uuid-della-fonte",
+  "source_title": "Titolo della Fonte",
+  "source_type": "url|file|youtube|drive",
+  "notebook_title": "Titolo del Notebook",
+  "source": "notebooklm"
+}
+```
+
+### 3. Query
+
+Quando esegui una query:
+
+1. **Embedding**: La domanda viene convertita in embedding
+2. **Search**: Qdrant cerca i chunk più simili
+3. **Filter**: Se specificati, filtra per `notebook_id`
+4. **Context**: I chunk vengono formattati come contesto
+5. **Generation**: Il LLM genera la risposta basata sul contesto
+
+---
+
+## API Reference
+
+### Sync Endpoints
+
+#### POST `/api/v1/notebooklm/sync/{notebook_id}`
+Sincronizza un notebook da NotebookLM al vector store.
+
+**Response:**
+```json
+{
+  "sync_id": "uuid-della-sync",
+  "notebook_id": "uuid-del-notebook",
+  "notebook_title": "Titolo Notebook",
+  "status": "success",
+  "sources_indexed": 5,
+  "total_chunks": 42,
+  "message": "Successfully synced 5 sources with 42 chunks"
+}
+```
+
+#### GET `/api/v1/notebooklm/indexed`
+Lista tutti i notebook sincronizzati.
+
+**Response:**
+```json
+{
+  "notebooks": [
+    {
+      "notebook_id": "uuid-1",
+      "notebook_title": "AI Research",
+      "sources_count": 10,
+      "chunks_count": 150,
+      "last_sync": "2026-01-15T10:30:00Z"
+    }
+  ],
+  "total": 1
+}
+```
+
+#### DELETE `/api/v1/notebooklm/sync/{notebook_id}`
+Rimuove un notebook dal vector store.
+
+**Response:**
+```json
+{
+  "notebook_id": "uuid-del-notebook",
+  "deleted": true,
+  "message": "Successfully removed index..."
+}
+```
+
+#### GET `/api/v1/notebooklm/sync/{notebook_id}/status`
+Verifica lo stato di sincronizzazione di un notebook.
+
+**Response:**
+```json
+{
+  "notebook_id": "uuid-del-notebook",
+  "status": "indexed",
+  "sources_count": 5,
+  "chunks_count": 42,
+  "last_sync": "2026-01-15T10:30:00Z"
+}
+```
+
+### Query Endpoints
+
+#### POST `/api/v1/query` (with notebook filter)
+Esegue una RAG query con possibilità di filtrare per notebook.
+
+**Request:**
+```json
+{
+  "question": "Quali sono i punti chiave?",
+  "notebook_ids": ["uuid-1", "uuid-2"],
+  "include_documents": true,
+  "k": 10,
+  "provider": "openai",
+  "model": "gpt-4o"
+}
+```
+
+**Response:**
+```json
+{
+  "question": "Quali sono i punti chiave?",
+  "answer": "Secondo i documenti e i notebook analizzati...",
+  "provider": "openai",
+  "model": "gpt-4o",
+  "sources": [
+    {
+      "text": "Contenuto del chunk...",
+      "source_type": "notebooklm",
+      "notebook_id": "uuid-1",
+      "notebook_title": "AI Research",
+      "source_title": "Introduction to AI"
+    }
+  ],
+  "user": "anonymous",
+  "filters_applied": {
+    "notebook_ids": ["uuid-1", "uuid-2"],
+    "include_documents": true
+  }
+}
+```
+
+#### POST `/api/v1/query/notebooks`
+Esegue una query **solo** sui notebook (esclude documenti locali).
+
+**Request:**
+```json
+{
+  "question": "Trova informazioni su...",
+  "notebook_ids": ["uuid-1"],
+  "k": 10,
+  "provider": "anthropic"
+}
+```
+
+---
+
+## Esempi di Utilizzo
+
+### Esempio 1: Sincronizzazione e Query Base
+
+```bash
+# 1. Sincronizza un notebook
+curl -X POST http://localhost:8000/api/v1/notebooklm/sync/abc-123
+
+# 2. Query sul notebook sincronizzato
+curl -X POST http://localhost:8000/api/v1/query/notebooks \
+  -H "Content-Type: application/json" \
+  -d '{
+    "question": "Quali sono le tecnologie AI menzionate?",
+    "notebook_ids": ["abc-123"]
+  }'
+```
+
+### Esempio 2: Ricerca Multi-Notebook
+
+```bash
+# Query su più notebook contemporaneamente
+curl -X POST http://localhost:8000/api/v1/query \
+  -H "Content-Type: application/json" \
+  -d '{
+    "question": "Confronta gli approcci di machine learning descritti",
+    "notebook_ids": ["notebook-1", "notebook-2", "notebook-3"],
+    "k": 15,
+    "provider": "anthropic"
+  }'
+```
+
+### Esempio 3: Workflow Completo
+
+```bash
+#!/bin/bash
+
+# 1. Ottieni lista notebook da NotebookLM
+NOTEBOOKS=$(curl -s http://localhost:8000/api/v1/notebooks)
+
+# 2. Sincronizza il primo notebook
+NOTEBOOK_ID=$(echo $NOTEBOOKS | jq -r '.data.items[0].id')
+echo "Sincronizzazione notebook: $NOTEBOOK_ID"
+
+SYNC_RESULT=$(curl -s -X POST "http://localhost:8000/api/v1/notebooklm/sync/$NOTEBOOK_ID")
+echo "Risultato: $SYNC_RESULT"
+
+# 3. Attendi che la sincronizzazione sia completata (se asincrona)
+sleep 2
+
+# 4. Esegui query sul notebook
+curl -X POST http://localhost:8000/api/v1/query/notebooks \
+  -H "Content-Type: application/json" \
+  -d "{
+    \"question\": \"Riassumi i punti principali\",
+    \"notebook_ids\": [\"$NOTEBOOK_ID\"],
+    \"provider\": \"openai\"
+  }"
+```
+
+---
+
+## Best Practices
+
+### 1. **Sincronizzazione Selettiva**
+Non sincronizzare tutti i notebook, solo quelli rilevanti per le ricerche.
+
+```bash
+# Sincronizza solo i notebook attivi
+for notebook_id in "notebook-1" "notebook-2"; do
+  curl -X POST "http://localhost:8000/api/v1/notebooklm/sync/$notebook_id"
+done
+```
+
+### 2. **Gestione Chunks**
+Ogni fonte viene divisa in chunks di ~1024 caratteri. Se un notebook ha molte fonti grandi, considera:
+- Aumentare `k` nelle query (default: 5, max: 50)
+- Filtrare per notebook specifici per ridurre il contesto
+
+### 3. **Provider Selection**
+Usa provider diversi per tipologie di query diverse:
+- **OpenAI GPT-4o**: Query complesse, analisi dettagliate
+- **Anthropic Claude**: Sintesi lunghe, analisi testuali
+- **Mistral**: Query veloci, risposte concise
+
+### 4. **Refresh Periodico**
+I notebook cambiano nel tempo. Considera di:
+- Rimuovere e risincronizzare periodicamente
+- Aggiungere un job schedulato per il refresh
+
+```bash
+# Cron job per refresh settimanale
+0 2 * * 0 /path/to/sync-notebooks.sh
+```
+
+### 5. **Monitoraggio**
+Traccia quali notebook sono sincronizzati:
+
+```bash
+# Lista e verifica stato
+curl http://localhost:8000/api/v1/notebooklm/indexed | jq '.'
+```
+
+---
+
+## Troubleshooting
+
+### Problema: Sincronizzazione fallita
+
+**Sintomi**: Errore 500 durante la sincronizzazione
+
+**Causa**: NotebookLM potrebbe non avere il testo completo disponibile per alcune fonti
+
+**Soluzione**:
+1. Verifica che il notebook esista: `GET /api/v1/notebooks/{id}`
+2. Controlla che le fonti siano indicizzate: NotebookLM mostra "Ready"
+3. Alcune fonti (YouTube, Drive) potrebbero non avere testo estratto
+
+### Problema: Query non trova risultati
+
+**Sintomi**: Risposta "I don't have enough information..."
+
+**Verifica**:
+```bash
+# 1. Il notebook è sincronizzato?
+curl http://localhost:8000/api/v1/notebooklm/sync/{notebook_id}/status
+
+# 2. Quanti chunks ci sono?
+curl http://localhost:8000/api/v1/notebooklm/indexed
+```
+
+**Soluzione**:
+- Aumenta `k` nella query
+- Verifica che il contenuto sia stato effettivamente estratto
+- Controlla che l'embedding model sia configurato correttamente
+
+### Problema: Rate Limiting
+
+**Sintomi**: Errori 429 durante sincronizzazione
+
+**Soluzione**:
+- NotebookLM ha rate limits aggressivi
+- Aggiungi delay tra le sincronizzazioni
+- Sincronizza durante ore di basso traffico
+
+```python
+# Aggiungi delay
+import asyncio
+
+for notebook_id in notebook_ids:
+    await sync_notebook(notebook_id)
+    await asyncio.sleep(5)  # Attendi 5 secondi
+```
+
+---
+
+## Performance Considerations
+
+### Dimensione dei Chunks
+- **Default**: 1024 caratteri
+- **Trade-off**: 
+  - Chunks più grandi = più contesto ma meno precisione
+  - Chunks più piccoli = più precisione ma meno contesto
+
+### Numero di Notebook
+- **Consigliato**: < 50 notebook sincronizzati contemporaneamente
+- **Ottimale**: Filtra per notebook specifici nelle query
+
+### Refresh Strategy
+- **Full Refresh**: Rimuovi tutto e risincronizza (lento ma pulito)
+- **Incremental**: Aggiungi solo nuove fonti (più veloce ma può avere duplicati)
+
+---
+
+## Limitazioni Conosciute
+
+1. **Testo Completo**: Non tutte le fonti di NotebookLM hanno testo completo disponibile (es. alcuni PDF, YouTube)
+2. **Sync Non Automatica**: La sincronizzazione è manuale via API, non automatica
+3. **Storage**: I chunks duplicano lo storage (contenuto sia in NotebookLM che in Qdrant)
+4. **Embedding Model**: Attualmente usa OpenAI per embeddings (configurabile in futuro)
+
+---
+
+## Roadmap
+
+- [ ] **Auto-Sync**: Sincronizzazione automatica quando i notebook cambiano
+- [ ] **Incremental Sync**: Aggiornamento solo delle fonti modificate
+- [ ] **Multi-Embedder**: Supporto per altri modelli di embedding
+- [ ] **Semantic Chunking**: Chunking basato su significato anziché lunghezza
+- [ ] **Cross-Reference**: Link tra fonti simili in notebook diversi
+
+---
+
+**Versione**: 1.0.0  
+**Ultimo Aggiornamento**: 2026-04-06