4 posts tagged with "python"

Real-time Voice Chat with AI

March 4, 2024 · 9 min read

CEO at Autohost.ai

How hard is it to build an AI scammer or a frontdesk assistant? Not hard at all.

AI research is progressing at a breakneck pace thanks to the large investments in the field over the last decade and increasing computational power. The demand for AI has exceeded initial expectations, with businesses and individuals alike relying on AI to make their daily tasks more efficient. New companies are emerging to capture business opportunities in the AI space. One such company is Groq, which is developing a new AI inference accelerator. Groq promises to offer the fastest (and cheapest) AI inference price per 1M tokens.

Now, let's talk about chatting with AI in real-time. It's not as simple as it sounds. Imagine having a conversation where every reply comes after an awkward pause—it wouldn't be fun, or believable, right? For AI to keep up in a real chit-chat, it needs to snap back with answers in less than a blink. That's under 500 milliseconds, to be exact. Groq's hardware, combined with the right AI models, makes this kind of speedy banter possible.

Think about all the cool stuff we can do with real-time voice chat AI. It's not just about asking your phone for the weather; it's about revolutionizing customer service, creating new ways to interact with technology hands-free, and offering a helping hand to those in need. But let's not sugarcoat it—there's a flip side. Just as we can use AI for good, some will try to use it for scams and other shady stuff. In this post, we're taking a deep dive into the world of real-time AI voice chats, showing you the good, the bad, and how to get your hands dirty building it.

You can skip to the Demo Videos section to see it all working together.

Requirements

This project will require the following:

A MacBook M2 with minimum 16GB RAM
Free API key from Groq
Free API key from Eleven Labs
Python 3.10 and other software dependencies

Lab Setup

We begin with installing the required software.

Install hombebrew on your Mac if you haven't already.

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Using brew, install python@3.10, portaudio, and ffmpeg.

brew install python@3.10 portaudio ffmpeg

portaudio is required for the pyaudio package.
ffmpeg is required for the pydub package.

Now we can create the virtual environment and install the required Python packages. Virtual environments are used to isolate the dependencies of a project from the system's Python installation.

python3.10 -m venv venv

Activate the virtual environment.

source venv/bin/activate

Great. Now we can download whisper.cpp which is a local speech-to-text model. This model is a port of OpenAI's Whisper model in C/C++. Using this model locally will improve the response time of the AI. Also, this specific model is optimized to run using Apple's CoreML framework to take advantage of the M2's neural engine.

Install the latest version from source:

pip install git+https://github.com/aarnphm/whispercpp.git -vv

Create a new file called chat.py and add the following code to import the required packages.

import os
import wave
from pydub import AudioSegment
from groq import Groq
from whispercpp import Whisper
from elevenlabs import generate, stream
import pyaudio

Define the required API keys:

# Set the API keys
os.environ["ELEVEN_API_KEY"] = "YOUR API KEY"
os.environ["GROQ_API_KEY"] = "YOUR API KEY"

Download and initialize the Whisper model:

# Initialize the Whisper client
whisper = Whisper('tiny')

We are using Groq instead of OpenAI because it is faster. Create the Groq client:

# Create API clients
groq_client = Groq(
    api_key=os.environ.get("GROQ_API_KEY"),
)

Define the system prompt:

# Set the system prompt
SYSTEM_PROMPT = "\n".join([
    "You are a friendly hotel frontdesk agent. You are here to help guests with their problems.",
    "Your responses must be very short. All of your responses must be coversational as if speaking to someone.",
    "Check-in is available after 3 PM, and check out is at 11 the next day."
])

Create the output folder for audio files:

# Output directory
output_dir = 'output'
os.makedirs(output_dir, exist_ok=True)

Create a helper function to play the AI speech:

def play_speech(prompt):
    audio_stream = generate(
      text=prompt,
      model="eleven_multilingual_v2",
      voice="Rachel",
      stream=True,
    )
    stream(audio_stream)

Create a function to generate LLM responses using Groq:

def llm_chat(user_input, chat_history, bot_name):

    # Add the user input to the chat history
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        *chat_history,
        {"role": "user", "content": user_input}
    ]

    # Create the chat completion
    chat_completion = groq_client.chat.completions.create(
        messages=messages,
        model="mixtral-8x7b-32768"
    )

    # Extract the LLM response
    response = chat_completion.choices[0].message.content
    print(f"{bot_name}: {response}")

    return response

Create a function to transcribe the user's speech using Whisper:

def transcribe_audio(audio_file):

    # Transcribe the audio
    result = whisper.transcribe(audio_file)

    # Extract the transcription
    texts = whisper.extract_text(result)
    
    # Remove empty spaces and return as a single string
    return " ".join([text.lower() for text in texts if text.strip()])

Create a function to record the user's speech:

def record_audio(file_path):
    
    # Initialize the PyAudio object
    p = pyaudio.PyAudio()
    
    # Set the audio parameters
    FORMAT = pyaudio.paInt16
    CHANNELS = 1
    RATE = 44100
    CHUNK = 512
    RECORD_SECONDS = 5
    
    # Create the audio stream
    stream = p.open(
        format=FORMAT,
        channels=CHANNELS,
        rate=RATE,
        input=True,
        frames_per_buffer=CHUNK
    )
    
    # Empty list to store the audio frames
    frames = []

    print("Recording...")
    
    # Record the audio
    try:
        for i in range(0, int(RATE / CHUNK * RECORD_SECONDS)):
            data = stream.read(CHUNK)
            frames.append(data)
    except KeyboardInterrupt:
        pass
    except Exception as e:
        print(f"Error while recording: {e}")
        raise e

    print("Recording complete.")

    # Close the stream
    stream.stop_stream()
    stream.close()
    p.terminate()

    # Modify the audio file
    wf = wave.open(file_path, 'wb')
    wf.setnchannels(1)
    wf.setsampwidth(p.get_sample_size(FORMAT))
    wf.setframerate(RATE)
    wf.writeframes(b''.join(frames))
    wf.close()

Create the main function to run the chat:

def converse():
    audio_file = "recording.wav"
    chat_history = []

    play_speech("Hello, welcome to SkyLounge Hotel. How can I help you today?")

    while True:

        # Record the user's audio
        record_audio(audio_file)

        # Transcribe the user's audio
        user_speech = transcribe_audio(audio_file)

        # # Delete the temp audio file
        os.remove(audio_file)
        
        # Exit the chat if the user says "exit"
        if user_speech.lower() == "exit":
            break

        # Add the user's speech to the chat history
        chat_history.append({"role": "user", "content": user_speech})
        print(f"You: {user_speech}")

        # Send the user's speech to the LLM
        bot_response = llm_chat(user_speech, chat_history, "Bot")

        # Append the LLM response to the chat history
        chat_history.append({"role": "assistant", "content": bot_response})

        # Play the LLM response using text-to-speech
        play_speech(bot_response)

        # Remove old chats from the chat history
        if len(chat_history) > 20:
            chat_history = chat_history[-20:]


if __name__ == "__main__":
    converse()

And that's it! You can now run the chat.py file to start the real-time voice chat with the AI.

python chat.py

You can find the full script here.

import os
import wave
from pydub import AudioSegment
from groq import Groq
from whispercpp import Whisper
from elevenlabs import generate, stream
import pyaudio


# Initialize the Whisper client
whisper = Whisper('tiny')


# Set the API keys
os.environ["ELEVEN_API_KEY"] = "YOUR API KEY"
os.environ["GROQ_API_KEY"] = "YOUR API KEY"


# Create API clients
groq_client = Groq(
    api_key=os.environ.get("GROQ_API_KEY"),
)


# Set the system prompt
SYSTEM_PROMPT = "\n".join([
    "You are a friendly hotel frontdesk agent. You are here to help guests with their problems.",
    "Your responses must be very short. All of your responses must be coversational as if speaking to someone.",
    "Check-in is available after 3 PM, and check out is at 11 the next day."
])


# Output directory
output_dir = 'output'
os.makedirs(output_dir, exist_ok=True)


def play_speech(prompt):
    audio_stream = generate(
      text=prompt,
      model="eleven_multilingual_v2",
      voice="Rachel",
      stream=True,
    )
    stream(audio_stream)


def llm_chat(user_input, chat_history, bot_name):

    # Add the user input to the chat history
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        *chat_history,
        {"role": "user", "content": user_input}
    ]

    # Create the chat completion
    chat_completion = groq_client.chat.completions.create(
        messages=messages,
        model="mixtral-8x7b-32768"
    )

    # Extract the LLM response
    response = chat_completion.choices[0].message.content
    print(f"{bot_name}: {response}")

    return response


def transcribe_audio(audio_file):

    # Transcribe the audio
    result = whisper.transcribe(audio_file)

    # Extract the transcription
    texts = whisper.extract_text(result)

    return " ".join([text.lower() for text in texts if text.strip()])


def record_audio(file_path):
    p = pyaudio.PyAudio()

    FORMAT = pyaudio.paInt16
    CHANNELS = 1
    RATE = 44100
    CHUNK = 512
    RECORD_SECONDS = 5

    stream = p.open(
        format=FORMAT,
        channels=CHANNELS,
        rate=RATE,
        input=True,
        frames_per_buffer=CHUNK
    )
    frames = []

    print("Recording...")

    try:
        for i in range(0, int(RATE / CHUNK * RECORD_SECONDS)):
            data = stream.read(CHUNK)
            frames.append(data)
    except KeyboardInterrupt:
        pass
    except Exception as e:
        print(f"Error while recording: {e}")
        raise e

    print("Recording complete.")

    # Close the stream
    stream.stop_stream()
    stream.close()
    p.terminate()

    # Modify the audio file
    wf = wave.open(file_path, 'wb')
    wf.setnchannels(1)
    wf.setsampwidth(p.get_sample_size(FORMAT))
    wf.setframerate(RATE)
    wf.writeframes(b''.join(frames))
    wf.close()

def converse():
    audio_file = "recording.wav"
    chat_history = []

    play_speech("Hello, welcome to SkyLounge Hotel. How can I help you today?")

    while True:

        # Record the user's audio
        record_audio(audio_file)

        # Transcribe the user's audio
        user_speech = transcribe_audio(audio_file)

        # # Delete the temp audio file
        os.remove(audio_file)

        if user_speech.lower() == "exit":
            break

        # Add the user's speech to the chat history
        chat_history.append({"role": "user", "content": user_speech})
        print(f"You: {user_speech}")

        # Send the user's speech to the LLM
        bot_response = llm_chat(user_speech, chat_history, "Bot")

        # Append the LLM response to the chat history
        chat_history.append({"role": "assistant", "content": bot_response})

        # Play the LLM response using text-to-speech
        play_speech(bot_response)

        # Remove old chats from the chat history
        if len(chat_history) > 20:
            chat_history = chat_history[-20:]


if __name__ == "__main__":
    converse()

Demo Videos

Hotel frontdesk demo:

Bank scam demo:

Conclusions

It is trivial to build a real-time voice chat with AI using the latest hardware and software. The bad guys are already experimenting and using AI in their campaigns. But this technology is not going away. In fact, it will only become more prevalent. I'm betting on Apple to take the lead soon with personal assistant AI that can chat in real-time, and live on your device. AI models are becoming more efficient, and mobile devices are being redesigned to include more neural engines to power these models.

Also, check out these related projects if you want to run text-to-speech (TTS) and speech-to-text (STT) locally:

Autohost

Talk to Sales

Identity Preserving AI Image Generation

February 26, 2024 · 6 min read

Roy Firestein

CEO at Autohost.ai

Let's generate AI images in our likeness.

As a startup with remote team members, we don't get many chances to meet in person. This makes it difficult to take professional team photos for our website and marketing materials. I tried using Midjourney to generate AI images by giving it a few base images to copy the style from, but the results were not very good. Companies like Midjourney and OpenAI have put protections in place to prevent misuse of their technology, such as copyright infringement and deepfakes.

As a self-proclaimed hacker, I knew I can use open-source tools and public research to achieve my goals. I found a paper called InstantID: Identity Preserving Zero-Shot Image Generation which describes a method to generate AI images in the likeness of a person. This method uses a diffusion-based model, similar to OpenAI's DALL-E, to generate images from a text prompt. Specifically, we'll use the SDXL model from Stability.ai.

Hardware Requirements

Before we begin, we'll need a powerful computer with a great GPU. THe ideal specifications are:

1x NVIDIA A100 GPU
60+ GB of RAM
200 GB of SSD storage

Don't worry, I don't have this hardware either. We can use Google Colab, which lets us run AI models on professional-grade hardware. Unfortunately, our system requirements exceed the free tier of Google Colab, so we'll need to pay for the Colab Pro subscription. The cost is $10 per month, which is a small price to pay for the power we'll get.

Getting Started

First, select the correct hardware profile in the notebook settings. Click on the "Runtime" menu, then "Change runtime type". Then select "A100 GPU" as the hardware accelerator. Also, make sure to toggle on the "High-RAM" option.

In the notebook, we'll need to upload some assets for the model to use. You can find a zip file with the assets here.

The assets include a face embedding model, the ControlNet, and the Image Prompt-adapter.

Extract the zip file and upload the files to your Google Colab notebook. The folder structure should look like this:

ip_adapter/
    attention_processor.py
    resampler.py
    utils.py
models/
    antelopev2/
        1k3d68.onnx
        2d106det.onnx
        genderage.onnx
        glitr100k.onnx
        scrfd_10g_bnkps.onnx
pipeline_stable_diffusion_xl_instantid.py

Now that we have the assets and the correct runtime, we can start writing the code.

We begin by downloading the InstantID model from the Hugging Face Hub. The files will be saved to the checkpoints directory.

from huggingface_hub import hf_hub_download
hf_hub_download(repo_id="InstantX/InstantID", filename="ControlNetModel/config.json", local_dir="./checkpoints")
hf_hub_download(repo_id="InstantX/InstantID", filename="ControlNetModel/diffusion_pytorch_model.safetensors", local_dir="./checkpoints")
hf_hub_download(repo_id="InstantX/InstantID", filename="ip-adapter.bin", local_dir="./checkpoints")

Next, we'll install the required Python packages.

!pip install opencv-python transformers accelerate onnxruntime onnxruntime-gpu insightface diffusers pillow controlnet-aux

In the next cell we will load the necessary python libraries and helper functions.

import cv2
import torch
import numpy as np
from PIL import Image
from diffusers.utils import load_image
from diffusers.models import ControlNetModel
from diffusers.pipelines.controlnet.multicontrolnet import MultiControlNetModel
from insightface.app import FaceAnalysis
from pipeline_stable_diffusion_xl_instantid import StableDiffusionXLInstantIDPipeline, draw_kps

# Helper function to resize the image
def resize_img(input_image, max_side=1280, min_side=1024, size=None,
               pad_to_max_side=False, mode=Image.BILINEAR, base_pixel_number=64):

    w, h = input_image.size
    if size is not None:
        w_resize_new, h_resize_new = size
    else:
        ratio = min_side / min(h, w)
        w, h = round(ratio*w), round(ratio*h)
        ratio = max_side / max(h, w)
        input_image = input_image.resize([round(ratio*w), round(ratio*h)], mode)
        w_resize_new = (round(ratio * w) // base_pixel_number) * base_pixel_number
        h_resize_new = (round(ratio * h) // base_pixel_number) * base_pixel_number
    input_image = input_image.resize([w_resize_new, h_resize_new], mode)

    if pad_to_max_side:
        res = np.ones([max_side, max_side, 3], dtype=np.uint8) * 255
        offset_x = (max_side - w_resize_new) // 2
        offset_y = (max_side - h_resize_new) // 2
        res[offset_y:offset_y+h_resize_new, offset_x:offset_x+w_resize_new] = np.array(input_image)
        input_image = Image.fromarray(res)
    return input_image

Next, we prepare the models and network.

# Load face encoder
app = FaceAnalysis(
    name='antelopev2',
    root='./',
    providers=['CUDAExecutionProvider', 'CPUExecutionProvider']
)
app.prepare(ctx_id=0, det_size=(640, 640))

# Path to InstantID models
face_adapter = f'./checkpoints/ip-adapter.bin'
controlnet_path = f'./checkpoints/ControlNetModel'
controlnet_depth_path = f'diffusers/controlnet-depth-sdxl-1.0-small'

# Load controlnet
controlnet = ControlNetModel.from_pretrained(controlnet_path, torch_dtype=torch.float16)

# Base model from Stability.ai
base_model_path = 'stabilityai/stable-diffusion-xl-base-1.0'

# Create pipeline
pipe = StableDiffusionXLInstantIDPipeline.from_pretrained(
    base_model_path,
    controlnet=controlnet,
    torch_dtype=torch.float16,
)

# enable GPU offloading
pipe.cuda()

# load the image-prompt adapter
pipe.load_ip_adapter_instantid(face_adapter)

Okay, we're ready to generate our first image. Let's load an image of ourselves and generate an AI image in our likeness.

# load an image
face_image = load_image("./examples/roy-1.jpg")

# resize the image
face_image = resize_img(face_image)

The model can't read bits and bytes of computer data, so we need to covert the image to something that the model can understand. This process is called embedding, and it's a way to represent the image as a set of numbers. Large Language Models (LLMs) like GPT-4 and DALL-E use embeddings to understand and generate text and images.

# prepare face embeddings
face_info = app.get(cv2.cvtColor(np.array(face_image), cv2.COLOR_RGB2BGR))

# only use the maximum face
face_info = sorted(face_info, key=lambda x:(x['bbox'][2]-x['bbox'][0])*x['bbox'][3]-x['bbox'][1])[-1]
face_emb = face_info['embedding']
face_kps = draw_kps(face_image, face_info['kps'])

Now we create the prompt for the model to generate the image.

prompt = " ".join([
    "comic portrait of a man.",
    "graphic illustration, comic art, graphic novel art, vibrant, highly detailed",
])

negative_prompt = " ".join([
    "photograph, deformed, glitch, noisy, realistic",
    "stock photo, black and white"
])

The negative prompt is used to guide the model away from generating images that are too realistic.

Finally, we generate the image.

# generate image
image = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    image_embeds=face_emb,
    image=face_kps,
    controlnet_conditioning_scale=0.8,
    ip_adapter_scale=0.8,
    num_inference_steps=30,
    guidance_scale=5,
).images[0]

# save the result
image.save('result.jpg')

Here's the result:

Examples from the research paper:

You can find a copy of this notebook here.

The resource utilization on the Google Colab Pro instance is not insignificant:

Conclusion

The technology can be used for a variety of purposes, some are exciting and some are scary. Tools are not inherently good or bad, it's how they are used that matters. As humans, we like to explore and push the boundaries of what's possible. Some of us will try to abuse the technology, while others will create counter-measures to protect or detect abuse.

Here are some other use cases for the technology:

Synthetic photos of identification documents (driver's license, passport, etc.)
Fake profile pictures for social media
Generate images for marketing materials
Create custom avatars for video games

Autohost

Talk to Sales

Accelerate Your Marketing Copy Creation with AI Agents

February 5, 2024 · 11 min read

Roy Firestein

CEO at Autohost.ai

Crafting marketing copy demands considerable time and creativity, resources that early-stage startups often lack. Imagine streamlining this process through automation, enabling you to produce marketing copy swiftly with the help of AI tools. This article delves into the utilization of AI agents to expedite your marketing copy creation, offering support in your promotional endeavors.

Autohost

Talk to Sales

Sales Revenue Forecasting with Python

February 4, 2024 · 4 min read

Roy Firestein

CEO at Autohost.ai

Ready to see the future? Let's dive into the art of predicting your sales numbers with a dash of Python wizardry. In this guide, we'll harness the power of time series forecasting to unveil the secrets of your future sales revenue. Armed with pandas for data mastery and statsmodels for crafting our crystal ball (aka forecasting model), we're setting you up to forecast like a pro.

Autohost

Talk to Sales

Requirements​

Lab Setup​

Demo Videos​

Conclusions​

Autohost

Hardware Requirements​

Getting Started​

Conclusion​

Autohost

Autohost

Autohost

Requirements

Lab Setup

Demo Videos

Conclusions

Hardware Requirements

Getting Started

Conclusion