Skip to main content

Tone of Voice in Sales

· 3 min read
Roy Firestein
CEO at Autohost.ai

As a Business Development Representative (BDR), your success hinges on your ability to connect with prospects and uncover their pain points. While your script and product knowledge are essential, there's one often-overlooked skill that can make or break your conversations: your tone.

Real-time Voice Chat with AI

· 9 min read
Roy Firestein
CEO at Autohost.ai

How hard is it to build an AI scammer or a frontdesk assistant? Not hard at all.

AI research is progressing at a breakneck pace thanks to the large investments in the field over the last decade and increasing computational power. The demand for AI has exceeded initial expectations, with businesses and individuals alike relying on AI to make their daily tasks more efficient. New companies are emerging to capture business opportunities in the AI space. One such company is Groq, which is developing a new AI inference accelerator. Groq promises to offer the fastest (and cheapest) AI inference price per 1M tokens.

Now, let's talk about chatting with AI in real-time. It's not as simple as it sounds. Imagine having a conversation where every reply comes after an awkward pause—it wouldn't be fun, or believable, right? For AI to keep up in a real chit-chat, it needs to snap back with answers in less than a blink. That's under 500 milliseconds, to be exact. Groq's hardware, combined with the right AI models, makes this kind of speedy banter possible.

Think about all the cool stuff we can do with real-time voice chat AI. It's not just about asking your phone for the weather; it's about revolutionizing customer service, creating new ways to interact with technology hands-free, and offering a helping hand to those in need. But let's not sugarcoat it—there's a flip side. Just as we can use AI for good, some will try to use it for scams and other shady stuff. In this post, we're taking a deep dive into the world of real-time AI voice chats, showing you the good, the bad, and how to get your hands dirty building it.

You can skip to the Demo Videos section to see it all working together.

Requirements

This project will require the following:

  • A MacBook M2 with minimum 16GB RAM
  • Free API key from Groq
  • Free API key from Eleven Labs
  • Python 3.10 and other software dependencies

Lab Setup

We begin with installing the required software.

Install hombebrew on your Mac if you haven't already.

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Using brew, install python@3.10, portaudio, and ffmpeg.

brew install python@3.10 portaudio ffmpeg
  • portaudio is required for the pyaudio package.
  • ffmpeg is required for the pydub package.

Now we can create the virtual environment and install the required Python packages. Virtual environments are used to isolate the dependencies of a project from the system's Python installation.

python3.10 -m venv venv

Activate the virtual environment.

source venv/bin/activate

Great. Now we can download whisper.cpp which is a local speech-to-text model. This model is a port of OpenAI's Whisper model in C/C++. Using this model locally will improve the response time of the AI. Also, this specific model is optimized to run using Apple's CoreML framework to take advantage of the M2's neural engine.

Install the latest version from source:

pip install git+https://github.com/aarnphm/whispercpp.git -vv

Create a new file called chat.py and add the following code to import the required packages.

import os
import wave
from pydub import AudioSegment
from groq import Groq
from whispercpp import Whisper
from elevenlabs import generate, stream
import pyaudio

Define the required API keys:

# Set the API keys
os.environ["ELEVEN_API_KEY"] = "YOUR API KEY"
os.environ["GROQ_API_KEY"] = "YOUR API KEY"

Download and initialize the Whisper model:

# Initialize the Whisper client
whisper = Whisper('tiny')

We are using Groq instead of OpenAI because it is faster. Create the Groq client:

# Create API clients
groq_client = Groq(
api_key=os.environ.get("GROQ_API_KEY"),
)

Define the system prompt:

# Set the system prompt
SYSTEM_PROMPT = "\n".join([
"You are a friendly hotel frontdesk agent. You are here to help guests with their problems.",
"Your responses must be very short. All of your responses must be coversational as if speaking to someone.",
"Check-in is available after 3 PM, and check out is at 11 the next day."
])

Create the output folder for audio files:

# Output directory
output_dir = 'output'
os.makedirs(output_dir, exist_ok=True)

Create a helper function to play the AI speech:

def play_speech(prompt):
audio_stream = generate(
text=prompt,
model="eleven_multilingual_v2",
voice="Rachel",
stream=True,
)
stream(audio_stream)

Create a function to generate LLM responses using Groq:

def llm_chat(user_input, chat_history, bot_name):

# Add the user input to the chat history
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
*chat_history,
{"role": "user", "content": user_input}
]

# Create the chat completion
chat_completion = groq_client.chat.completions.create(
messages=messages,
model="mixtral-8x7b-32768"
)

# Extract the LLM response
response = chat_completion.choices[0].message.content
print(f"{bot_name}: {response}")

return response

Create a function to transcribe the user's speech using Whisper:

def transcribe_audio(audio_file):

# Transcribe the audio
result = whisper.transcribe(audio_file)

# Extract the transcription
texts = whisper.extract_text(result)

# Remove empty spaces and return as a single string
return " ".join([text.lower() for text in texts if text.strip()])

Create a function to record the user's speech:

def record_audio(file_path):

# Initialize the PyAudio object
p = pyaudio.PyAudio()

# Set the audio parameters
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 44100
CHUNK = 512
RECORD_SECONDS = 5

# Create the audio stream
stream = p.open(
format=FORMAT,
channels=CHANNELS,
rate=RATE,
input=True,
frames_per_buffer=CHUNK
)

# Empty list to store the audio frames
frames = []

print("Recording...")

# Record the audio
try:
for i in range(0, int(RATE / CHUNK * RECORD_SECONDS)):
data = stream.read(CHUNK)
frames.append(data)
except KeyboardInterrupt:
pass
except Exception as e:
print(f"Error while recording: {e}")
raise e

print("Recording complete.")

# Close the stream
stream.stop_stream()
stream.close()
p.terminate()

# Modify the audio file
wf = wave.open(file_path, 'wb')
wf.setnchannels(1)
wf.setsampwidth(p.get_sample_size(FORMAT))
wf.setframerate(RATE)
wf.writeframes(b''.join(frames))
wf.close()

Create the main function to run the chat:

def converse():
audio_file = "recording.wav"
chat_history = []

play_speech("Hello, welcome to SkyLounge Hotel. How can I help you today?")

while True:

# Record the user's audio
record_audio(audio_file)

# Transcribe the user's audio
user_speech = transcribe_audio(audio_file)

# # Delete the temp audio file
os.remove(audio_file)

# Exit the chat if the user says "exit"
if user_speech.lower() == "exit":
break

# Add the user's speech to the chat history
chat_history.append({"role": "user", "content": user_speech})
print(f"You: {user_speech}")

# Send the user's speech to the LLM
bot_response = llm_chat(user_speech, chat_history, "Bot")

# Append the LLM response to the chat history
chat_history.append({"role": "assistant", "content": bot_response})

# Play the LLM response using text-to-speech
play_speech(bot_response)

# Remove old chats from the chat history
if len(chat_history) > 20:
chat_history = chat_history[-20:]


if __name__ == "__main__":
converse()

And that's it! You can now run the chat.py file to start the real-time voice chat with the AI.

python chat.py
You can find the full script here.
import os
import wave
from pydub import AudioSegment
from groq import Groq
from whispercpp import Whisper
from elevenlabs import generate, stream
import pyaudio


# Initialize the Whisper client
whisper = Whisper('tiny')


# Set the API keys
os.environ["ELEVEN_API_KEY"] = "YOUR API KEY"
os.environ["GROQ_API_KEY"] = "YOUR API KEY"


# Create API clients
groq_client = Groq(
api_key=os.environ.get("GROQ_API_KEY"),
)


# Set the system prompt
SYSTEM_PROMPT = "\n".join([
"You are a friendly hotel frontdesk agent. You are here to help guests with their problems.",
"Your responses must be very short. All of your responses must be coversational as if speaking to someone.",
"Check-in is available after 3 PM, and check out is at 11 the next day."
])


# Output directory
output_dir = 'output'
os.makedirs(output_dir, exist_ok=True)


def play_speech(prompt):
audio_stream = generate(
text=prompt,
model="eleven_multilingual_v2",
voice="Rachel",
stream=True,
)
stream(audio_stream)


def llm_chat(user_input, chat_history, bot_name):

# Add the user input to the chat history
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
*chat_history,
{"role": "user", "content": user_input}
]

# Create the chat completion
chat_completion = groq_client.chat.completions.create(
messages=messages,
model="mixtral-8x7b-32768"
)

# Extract the LLM response
response = chat_completion.choices[0].message.content
print(f"{bot_name}: {response}")

return response


def transcribe_audio(audio_file):

# Transcribe the audio
result = whisper.transcribe(audio_file)

# Extract the transcription
texts = whisper.extract_text(result)

return " ".join([text.lower() for text in texts if text.strip()])


def record_audio(file_path):
p = pyaudio.PyAudio()

FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 44100
CHUNK = 512
RECORD_SECONDS = 5

stream = p.open(
format=FORMAT,
channels=CHANNELS,
rate=RATE,
input=True,
frames_per_buffer=CHUNK
)
frames = []

print("Recording...")

try:
for i in range(0, int(RATE / CHUNK * RECORD_SECONDS)):
data = stream.read(CHUNK)
frames.append(data)
except KeyboardInterrupt:
pass
except Exception as e:
print(f"Error while recording: {e}")
raise e

print("Recording complete.")

# Close the stream
stream.stop_stream()
stream.close()
p.terminate()

# Modify the audio file
wf = wave.open(file_path, 'wb')
wf.setnchannels(1)
wf.setsampwidth(p.get_sample_size(FORMAT))
wf.setframerate(RATE)
wf.writeframes(b''.join(frames))
wf.close()

def converse():
audio_file = "recording.wav"
chat_history = []

play_speech("Hello, welcome to SkyLounge Hotel. How can I help you today?")

while True:

# Record the user's audio
record_audio(audio_file)

# Transcribe the user's audio
user_speech = transcribe_audio(audio_file)

# # Delete the temp audio file
os.remove(audio_file)

if user_speech.lower() == "exit":
break

# Add the user's speech to the chat history
chat_history.append({"role": "user", "content": user_speech})
print(f"You: {user_speech}")

# Send the user's speech to the LLM
bot_response = llm_chat(user_speech, chat_history, "Bot")

# Append the LLM response to the chat history
chat_history.append({"role": "assistant", "content": bot_response})

# Play the LLM response using text-to-speech
play_speech(bot_response)

# Remove old chats from the chat history
if len(chat_history) > 20:
chat_history = chat_history[-20:]


if __name__ == "__main__":
converse()

Demo Videos

Hotel frontdesk demo:

LLM chat demo as hotel staff

Bank scam demo:

LLM chat bank scam demo

Conclusions

It is trivial to build a real-time voice chat with AI using the latest hardware and software. The bad guys are already experimenting and using AI in their campaigns. But this technology is not going away. In fact, it will only become more prevalent. I'm betting on Apple to take the lead soon with personal assistant AI that can chat in real-time, and live on your device. AI models are becoming more efficient, and mobile devices are being redesigned to include more neural engines to power these models.

Also, check out these related projects if you want to run text-to-speech (TTS) and speech-to-text (STT) locally:

Identity Preserving AI Image Generation

· 6 min read
Roy Firestein
CEO at Autohost.ai

Let's generate AI images in our likeness.

As a startup with remote team members, we don't get many chances to meet in person. This makes it difficult to take professional team photos for our website and marketing materials. I tried using Midjourney to generate AI images by giving it a few base images to copy the style from, but the results were not very good. Companies like Midjourney and OpenAI have put protections in place to prevent misuse of their technology, such as copyright infringement and deepfakes.

As a self-proclaimed hacker, I knew I can use open-source tools and public research to achieve my goals. I found a paper called InstantID: Identity Preserving Zero-Shot Image Generation which describes a method to generate AI images in the likeness of a person. This method uses a diffusion-based model, similar to OpenAI's DALL-E, to generate images from a text prompt. Specifically, we'll use the SDXL model from Stability.ai.

Hardware Requirements

Before we begin, we'll need a powerful computer with a great GPU. THe ideal specifications are:

  • 1x NVIDIA A100 GPU
  • 60+ GB of RAM
  • 200 GB of SSD storage

Don't worry, I don't have this hardware either. We can use Google Colab, which lets us run AI models on professional-grade hardware. Unfortunately, our system requirements exceed the free tier of Google Colab, so we'll need to pay for the Colab Pro subscription. The cost is $10 per month, which is a small price to pay for the power we'll get.

Getting Started

Log in to your Google Colab account and create a new notebook.

First, select the correct hardware profile in the notebook settings. Click on the "Runtime" menu, then "Change runtime type". Then select "A100 GPU" as the hardware accelerator. Also, make sure to toggle on the "High-RAM" option.

In the notebook, we'll need to upload some assets for the model to use. You can find a zip file with the assets here.

The assets include a face embedding model, the ControlNet, and the Image Prompt-adapter.

Extract the zip file and upload the files to your Google Colab notebook. The folder structure should look like this:

ip_adapter/
attention_processor.py
resampler.py
utils.py
models/
antelopev2/
1k3d68.onnx
2d106det.onnx
genderage.onnx
glitr100k.onnx
scrfd_10g_bnkps.onnx
pipeline_stable_diffusion_xl_instantid.py

Now that we have the assets and the correct runtime, we can start writing the code.

We begin by downloading the InstantID model from the Hugging Face Hub. The files will be saved to the checkpoints directory.

from huggingface_hub import hf_hub_download
hf_hub_download(repo_id="InstantX/InstantID", filename="ControlNetModel/config.json", local_dir="./checkpoints")
hf_hub_download(repo_id="InstantX/InstantID", filename="ControlNetModel/diffusion_pytorch_model.safetensors", local_dir="./checkpoints")
hf_hub_download(repo_id="InstantX/InstantID", filename="ip-adapter.bin", local_dir="./checkpoints")

Next, we'll install the required Python packages.

!pip install opencv-python transformers accelerate onnxruntime onnxruntime-gpu insightface diffusers pillow controlnet-aux

In the next cell we will load the necessary python libraries and helper functions.

import cv2
import torch
import numpy as np
from PIL import Image
from diffusers.utils import load_image
from diffusers.models import ControlNetModel
from diffusers.pipelines.controlnet.multicontrolnet import MultiControlNetModel
from insightface.app import FaceAnalysis
from pipeline_stable_diffusion_xl_instantid import StableDiffusionXLInstantIDPipeline, draw_kps

# Helper function to resize the image
def resize_img(input_image, max_side=1280, min_side=1024, size=None,
pad_to_max_side=False, mode=Image.BILINEAR, base_pixel_number=64):

w, h = input_image.size
if size is not None:
w_resize_new, h_resize_new = size
else:
ratio = min_side / min(h, w)
w, h = round(ratio*w), round(ratio*h)
ratio = max_side / max(h, w)
input_image = input_image.resize([round(ratio*w), round(ratio*h)], mode)
w_resize_new = (round(ratio * w) // base_pixel_number) * base_pixel_number
h_resize_new = (round(ratio * h) // base_pixel_number) * base_pixel_number
input_image = input_image.resize([w_resize_new, h_resize_new], mode)

if pad_to_max_side:
res = np.ones([max_side, max_side, 3], dtype=np.uint8) * 255
offset_x = (max_side - w_resize_new) // 2
offset_y = (max_side - h_resize_new) // 2
res[offset_y:offset_y+h_resize_new, offset_x:offset_x+w_resize_new] = np.array(input_image)
input_image = Image.fromarray(res)
return input_image

Next, we prepare the models and network.

# Load face encoder
app = FaceAnalysis(
name='antelopev2',
root='./',
providers=['CUDAExecutionProvider', 'CPUExecutionProvider']
)
app.prepare(ctx_id=0, det_size=(640, 640))

# Path to InstantID models
face_adapter = f'./checkpoints/ip-adapter.bin'
controlnet_path = f'./checkpoints/ControlNetModel'
controlnet_depth_path = f'diffusers/controlnet-depth-sdxl-1.0-small'

# Load controlnet
controlnet = ControlNetModel.from_pretrained(controlnet_path, torch_dtype=torch.float16)

# Base model from Stability.ai
base_model_path = 'stabilityai/stable-diffusion-xl-base-1.0'

# Create pipeline
pipe = StableDiffusionXLInstantIDPipeline.from_pretrained(
base_model_path,
controlnet=controlnet,
torch_dtype=torch.float16,
)

# enable GPU offloading
pipe.cuda()

# load the image-prompt adapter
pipe.load_ip_adapter_instantid(face_adapter)

Okay, we're ready to generate our first image. Let's load an image of ourselves and generate an AI image in our likeness.

# load an image
face_image = load_image("./examples/roy-1.jpg")

# resize the image
face_image = resize_img(face_image)

The model can't read bits and bytes of computer data, so we need to covert the image to something that the model can understand. This process is called embedding, and it's a way to represent the image as a set of numbers. Large Language Models (LLMs) like GPT-4 and DALL-E use embeddings to understand and generate text and images.

# prepare face embeddings
face_info = app.get(cv2.cvtColor(np.array(face_image), cv2.COLOR_RGB2BGR))

# only use the maximum face
face_info = sorted(face_info, key=lambda x:(x['bbox'][2]-x['bbox'][0])*x['bbox'][3]-x['bbox'][1])[-1]
face_emb = face_info['embedding']
face_kps = draw_kps(face_image, face_info['kps'])

Now we create the prompt for the model to generate the image.

prompt = " ".join([
"comic portrait of a man.",
"graphic illustration, comic art, graphic novel art, vibrant, highly detailed",
])

negative_prompt = " ".join([
"photograph, deformed, glitch, noisy, realistic",
"stock photo, black and white"
])

The negative prompt is used to guide the model away from generating images that are too realistic.

Finally, we generate the image.

# generate image
image = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
image_embeds=face_emb,
image=face_kps,
controlnet_conditioning_scale=0.8,
ip_adapter_scale=0.8,
num_inference_steps=30,
guidance_scale=5,
).images[0]

# save the result
image.save('result.jpg')

Here's the result:

Examples from the research paper:

InstantID

You can find a copy of this notebook here.

The resource utilization on the Google Colab Pro instance is not insignificant:

Conclusion

The technology can be used for a variety of purposes, some are exciting and some are scary. Tools are not inherently good or bad, it's how they are used that matters. As humans, we like to explore and push the boundaries of what's possible. Some of us will try to abuse the technology, while others will create counter-measures to protect or detect abuse.

Here are some other use cases for the technology:

  • Synthetic photos of identification documents (driver's license, passport, etc.)
  • Fake profile pictures for social media
  • Generate images for marketing materials
  • Create custom avatars for video games

The Art of Outbound: Crafting Killer Calls and Emails That Convert

· 8 min read
Roy Firestein
CEO at Autohost.ai

Hey there, sales superstar! Ready to take your outbound game to the next level? In this post, we'll dive deep into the nitty-gritty of structuring outbound calls and emails that grab prospects' attention and get them eager to learn more.

First things first: let's talk about the foundation of any good outbound strategy - personalization. Gone are the days of generic, one-size-fits-all messaging. If you want to stand out in a crowded inbox or cut through the noise on a cold call, you need to do your homework and tailor your approach to each prospect's unique needs and pain points.

But how do you do that without spending hours stalking them on LinkedIn? Fear not, my friend. I've got some tried-and-true techniques to help you craft outbound calls and emails that are personalized, relevant, and most importantly, effective.