Thanks for your interest in our solution. Having specific examples of replication and cloning allows us to continue to grow and scale our work. If you clone or download this repository, kindly shoot us a quick email to let us know you are interested in this work!
Customers are responsible for making their own independent assessment of the information in this document.
This document:
(a) is for informational purposes only,
(b) represents current AWS product offerings and practices, which are subject to change without notice, and
(c) does not create any commitments or assurances from AWS and its affiliates, suppliers or licensors. AWS products or services are provided “as is” without warranties, representations, or conditions of any kind, whether express or implied. The responsibilities and liabilities of AWS to its customers are controlled by AWS agreements, and this document is not part of, nor does it modify, any agreement between AWS and its customers.
(d) is not to be considered a recommendation or viewpoint of AWS
Additionally, all prototype code and associated assets should be considered:
(a) as-is and without warranties
(b) not suitable for production environments
(d) to include shortcuts in order to support rapid prototyping such as, but not limitted to, relaxed authentication and authorization and a lack of strict adherence to security best practices
All work produced is open source. More information can be found in the GitHub repo.
-
The DxHub developed a chatbot solution that can answer user questions pulling from their knowledge base articles. The chatbot contains many features:
- Leverages Retrieval Augmented Generation (RAG) for accurate and contextual responses
- Dynamic context integration for more relevant and precise answers
- Real-time information retrieval
- Direct links to source documents
- Easy access to reference materials
- Serverless architecture enables automatic scaling
- API-first design supports multiple frontend implementations
- AWS CDK CLI, Docker (running), Python 3.x, Git, a CDK Bootstrapped environment
- AWS credentials configured
git clone https://github.com/cal-poly-dxhub/internet2-chatbot.git
cd internet2-chatbot
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
mv example_config.yaml config.yaml
In AWS Bedrock console → Model access, request access for:
anthropic.claude-3-5-sonnet-20241022-v2:0
amazon.titan-embed-text-v2:0
anthropic.claude-3-haiku-20240307-v1:0
cdk deploy
Update with CDK outputs:
opensearch_endpoint: <FROM_CDK_OUTPUT>
rag_api_endpoint: <FROM_CDK_OUTPUT>
s3_bucket_name: <FROM_CDK_OUTPUT>
step_function_arn: <FROM_CDK_OUTPUT>
processed_files_table: <FROM_CDK_OUTPUT>
api_key: <FROM_API_GATEWAY_CONSOLE>
# Upload files to S3
aws s3 cp your-documents/ s3://<S3_BUCKET_NAME>/files-to-process/ --recursive
Prerequisites: Google Cloud account, Atlassian account, LibreOffice installed
1. Set Up Atlassian API Access:
- Go to https://id.atlassian.com/manage-profile/security/api-tokens
- Click "Create API token", label it, and copy the token
- Save this token for later use
2. Set Up Google Service Account:
- Go to Google Cloud Console
- Create a new project (or use existing)
- Go to "APIs & Services" > "Library" and enable "Google Drive API"
- Go to "APIs & Services" > "Credentials"
- Click "Create Credentials" > "Service account"
- Give it a name and click "Create and Continue" > "Done"
- Click your new service account > "Keys" tab > "Add Key" > "Create new key" > "JSON"
- Download and save the JSON file (CRITICAL: Make sure the downloaded file is in a valid .json format)
3. Share Google Drive Access:
- Open Google Drive and right-click folders/files you want to ingest
- Click "Share" and add the service account email (from JSON file:
xxxx@xxxx.iam.gserviceaccount.com
) - Set as "Viewer" and click "Send"
4. Install LibreOffice:
# macOS
brew install --cask libreoffice
# Ubuntu/Debian
sudo apt-get install libreoffice
5. Configure Environment:
# Go into the code directory
cd ingest_utils
cd confluence_processor
# Set these variables in names.env:
SERVICE_ACC_SECRET_NAME=default-service-account-name
GOOGLE_DRIVE_CREDENTIALS=/path/to/your/service-account.json
GOOGLE_API_KEY=your-google-api-key
CONFLUENCE_API=your_atlassian_api_token_from_step_1
# Load environment variables
source names.env
# Set these variables in config.yaml and check for any missing fields
CONFLUENCE_URL=your-confleunce-url-to-scrape
6. Run Ingestion:
# Scrape asset links from Confluence wiki (creates a .csv file with links to folders)
python confluence_processor.py
# Download from Google Drive and upload to S3
python google_drive_processor.py
# Download .txt files from the wiki page descriptions section and upload to S3
python confluence_event_descriptions_to_s3.py
# Make sure you are in the ingest_utils directory
cd ingest_utils
python run_step_function.py
This automatically creates the OpenSearch index if needed, then starts document processing.
Note: Files set for processing are saved to DynamoDB to ensure there are no lost updates due to concurrent operations. To reset this cache run:
python run_step_function.py --reset-cache
Command line interface:
python chat_test.py
OR
Web interface:
streamlit run chat_frontend.py
Note: You can start testing immediately after Step 6, but response quality will improve as more documents are processed. Wait for full ingestion completion for best results.
- Docker access:
sudo usermod -aG docker $USER && newgrp docker
- CDK issues: Check
aws sts get-caller-identity
and runcdk bootstrap
- Model access: Verify in Bedrock console
- Processing fails: Check Step Function logs in AWS Console
- Chat issues: Verify API key and endpoint accessibility
- Quick PoC with no intent verification or error checking
For queries or issues:
- Darren Kraker, Sr Solutions Architect - dkraker@amazon.com
- Nick Riley, Jr SDE - njriley@amazon.edu
- Kartik Malunjkar, Software Development Engineer Intern- kmalunjk@calpoly.edu