UNPKG

chunk-match

Version:

NodeJS library that semantically chunks text and matches it against a user query using cosine similarity for precise and relevant text retrieval

1 lines 14.5 kB
Working with large language models (LLMs) has become increasingly common in this era. People frequently use LLMs and large multimodal models (LMMs) for tasks ranging from simple daily activities to large-scale company operations. Whether it�s for retrieving data or seeking assistance, LLMs have become an integral part of our routines. Consequently, proficiency in working with these models has become a highly sought-after skill in many industries today.Industry Trust in Large Language ModelsEstablished companies and responsible individuals in society cannot fully endorse everything generated by LLMs. We still cannot trust LLM outputs 100% in serious matters. There are often no verifiable sources, and LLMs are primarily trained on open data sources available. Imagine a 12-year-old girl who can speak multiple languages fluently, listens to everyone�s conversations, and relays this information to her mom when she gets home without arguments or logical thinking. Similarly, LLMs process and generate information without critical analysis. Popping of unofficial shadow AI systems used within organizations poses further risks.So, limiting use of AI is a not a good thing to do in this competition. As well as we can�t control ethical problems easily, but we can make a potential barrier to use only the trusted data.LLM Agent is a Young Prodigy with a Gift for CommunicationAs mentioned earlier, if we consider an LLM as a teenage girl or boy who can speak and understand requests well, we can give them access to a wealth of resources like books and official websites. By doing this, we can ask for the information we need, and they can provide the requested data, not from their own memory or experiences, but from the trusted books and other official sources we�ve provided. This approach saves us time in learning and verifying data, ensuring that the output is accurate because it is derived from the reliable sources we have given.Gif of RAG ArchitectureRetrieval Augmented Generation (RAG)This approach is similar to the Retrieval-Augmented Generation (RAG) architecture in artificial intelligence. In RAG, the model doesn�t rely solely on pre-trained data stored within its parameters. This ensures that the information provided is up-to-date and accurate, drawing directly from the most reliable and authoritative sources available. By using a RAG architecture, we can combine the strengths of both retrieval and generation. The retrieval component fetches precise information from trusted sources, while the generation component uses this information to produce coherent and contextually appropriate responses.So, I have created a small text data set related to 4 different sports. And those are collected from Wikipedia. So these data is taken from a trusted source and this is what we going to use as our DataSource.sen_1 = "Cricket is played between two teams of eleven on a field with a 22-yard pitch and a wicket at each end."sen_2 = "Two players from the batting team stand at either wicket with bats, while the bowler from the fielding team bowls the ball towards the striker's wicket."sen_3 = "The striker tries to hit the ball and switch places with the nonstriker to score runs; runs are also scored when the ball crosses the boundary or is bowled illegally."sen_4 = "The fielding team tries to dismiss batters by hitting the wicket with the ball, catching the ball, or preventing the batters from crossing the crease."sen_5 = "Football refers to a family of team sports involving kicking a ball to score a goal, with the term usually denoting the most popular form in a particular region."sen_6 = "Common forms of football include soccer, Australian rules football, Gaelic football, American football, Canadian football, rugby league, and rugby union."sen_7 = "Modern football codes evolved from traditional games played worldwide, with contemporary rules being codified in English public schools in the 19th century."sen_8 = "The spread of football rules was facilitated by the British Empire, and by the late 19th century, distinct regional codes like Gaelic football had developed."sen_9 = "Formula One (F1) is the highest class of international racing for open-wheel single-seater cars, governed by the F�d�ration Internationale de l'Automobile (FIA)."sen_10 = "The FIA Formula One World Championship, started in 1950, is a series of races called Grands Prix, held on circuits or closed public roads across various countries."sen_11 = "Points from Grands Prix determine annual World Championships for drivers and constructors, with drivers requiring a Super Licence and races occurring on grade one tracks."sen_12 = "F1 cars, known for their high speeds due to aerodynamic downforce, have undergone changes to reduce turbulence and improve overtaking, relying heavily on electronics, aerodynamics, suspension, and tyres."sen_13 = "Golf is a precision sport in which players use various clubs to hit balls into a series of holes on a course in as few strokes as possible."sen_14 = "The major professional golf tournaments, known as the majors, include the Masters, the U.S. Open, The Open Championship, and the PGA Championship, held annually across different prestigious courses."sen_15 = "Performance in these tournaments contributes to rankings and titles such as the World Golf Championships, with professional golfers needing a strong short and long game, as well as mental resilience."sen_16 = "Golf clubs, ranging from drivers to putters, are designed for different distances and shot types, with modern technology focusing on materials, aerodynamics, and swing analysis to enhance performance."Now we need to do embeddings turn our textual data of data source to numerical vectors for that we use free and opensource models from Huggin face. And this can map text to a low dimensional dense vector.embedding_model = FlagModel('BAAI/bge-large-zh-v1.5')Now, we iterate through our data source text using embedding model to convert each text into a numerical representation. These embeddings are then stored in a list, which is subsequently converted into a NumPy array for efficient numerical operations. This process is fundamental in preparing text data for various natural language processing tasks.embeddings = []for input_text in all_input_text: emb = embedding_model.encode(input_text) embeddings.append(emb)embeddings_array = np.array(embeddings)print("Shape: " + str(embeddings_array.shape), "\n")print("Array:", embeddings_array[0])Now we need to perform dimensionality reduction using scikit learn PCA model to simplify our high-dimensional text embeddings. By reducing the dimensionality from 1024 to 3, PCA will help us streamline the data while preserving important semantic information.PCA_model = PCA(n_components=3)PCA_model.fit(embeddings_array)new_values = PCA_model.transform(embeddings_array)print("Shape: " + str(new_values.shape))print(new_values)A Graphical Insight into Our Data Source EmbeddingsGraphical Visualization of how embeddings has spreadThe graph illustrates that the embeddings are clustered into three distinct areas. However, our data source encompasses information on four different sports. This discrepancy provides valuable insight into the workings of embeddings. To gain a clearer understanding, let�s examine this data from a 3D perspective.3D Visualization of how embeddings has spreadThe 3D scatter plot provides a clear view of how the embeddings for the four sports data are distributed. Each sport is represented by a distinct color intensity, with similar intensities clustering together to indicate data related to the same sport. The plot reveals that these embeddings are spread across four separate areas. However, some color intensities overlap between adjacent areas, which may suggest data describing different aspects of the same sport. For instance, the purple and light blue areas correspond to cricket and football, respectively. This proximity in the plot indicates that football and cricket share more similar features, which is visually represented by their close clustering in the embeddings. To further validate this observation, we can use cosine similarity to quantify the degree of similarity between the embeddings.Leveraging Cosine Similarity for Accurate Data Comparisonsdef compute_cosine_similarity(embeddings: np.ndarray, idx1: int, idx2: int) -> float: """ Computes the cosine similarity between two embeddings. Parameters: embeddings (np.ndarray): An array of embeddings. idx1 (int): The index of the first embedding. idx2 (int): The index of the second embedding. Returns: float: The cosine similarity between the two embeddings. """ return cosine_similarity([embeddings[idx1]],[embeddings[idx2]])[0][0]print("cricket_[0] vs cricket_[1] :", compute_cosine_similarity(embeddings,0,1))print("cricket_[0] vs F1_[9] :", compute_cosine_similarity(embeddings,0,9))print("football_[5] vs cricket_[1]:", compute_cosine_similarity(embeddings,5,1))print("golf_[15] vs F1_[10] :", compute_cosine_similarity(embeddings,15,10))Visualizing the Relationships Between Sport Embeddingscricket_[0] vs cricket_[1] : 0.7271751cricket_[0] vs F1_[9] : 0.4558613football_[5] vs cricket_[1]: 0.54893225golf_[15] vs F1_[10] : 0.5249833So lets do a comparison on cosine similarities between each sport.Table for compare Cosine similarities between each sport.Let�s Dive with Vector SearchSo, now we can do a vector search from prepared embeddings list.query = "What is cricket?"q_ = embedding_model.encode(query)The query �What is cricket?� is converted into a numerical vector, or embedding, using the embedding_model.encode() function. This transformation translates the textual query into a format that the model can process, encapsulating the meaning and context of the query. The resulting vector (q_) can then be utilized to compare the query with other texts, assess similarities, or conduct various natural language processing tasks.def search(embeddings:List, q_:List)->List[float]: """ Search for the cosine similarity scores between a query vector (q_) and a list of embedding vectors. Parameters: embeddings (List[List[float]]): A list of embedding vectors. q_ (List[float]): The query vector for which the cosine similarity scores are calculated. Returns: List[float]: A list of cosine similarity scores between the query vector and each embedding vector. """ scores = [] for vec in embeddings: scores.append(cosine_similarity([vec], [q_])[0][0]) return scoresscore_list = search(embeddings, q_)The search function computes cosine similarity scores between a given query vector (q_) and a list of embedding vectors. It iterates through each embedding in the list, calculates the cosine similarity with the query vector, and returns a list of similarity scores. These scores reflect the degree of closeness between each embedding and the query vector, indicating how well each embedding aligns with the query.n = 3sorted_indices = np.argsort(score_list)[::-1]top_n_indices = sorted_indices[:n]top_n_indicesretrieved_content = []for i in top_n_indices: print(all_input_text[i], "\n") retrieved_content.append(all_input_text[i])The code identifies the top n most similar items based on their similarity scores. It begins by sorting the scores to find the highest values, then selects the indices corresponding to the top n scores. Using these indices, it retrieves the relevant content from all_input_text. This content is then both printed and stored in retrieved_content. For the given query, the top 3 best matches are displayed as follows:Cricket is played between two teams of eleven on a field with a 22-yard pitch and a wicket at each end. Two players from the batting team stand at either wicket with bats, while the bowler from the fielding team bowls the ball towards the striker's wicket. The spread of football rules was facilitated by the British Empire, and by the late 19th century, distinct regional codes like Gaelic football had developed. Passing the Remaining Tasks to the LLM AgentHaving retrieved a set of indices relevant to our query, the next step is to fine-tune the output using a large language model (LLM). We provide these indices to the LLM and instruct it to use the corresponding data as its source. The LLM then generates a final response based solely on this provided data. In this case, I used Google�s generative AI as the LLM agent, inputting the question and requesting a response within 60 tokens. Below is the response I received:Response from the LLM agent using given data sourceQuestion: What is cricket?Answer:Cricket is a sport played between two teams of eleven players on a field with a 22-yard pitch and a wicket at each end. Two players from the batting team stand at either wicket with bats, while the bowler from the fielding team bowls the ball towards the striker's wicket.Let's try few more Questions.Question: When is the formular championship started?Answer:The FIA Formula One World Championship started in 1950.Question: In which sport, kick a ball to get a goal?Answer:FootballQuestion: Who is the fastest F1 Driver?Answer:The provided text does not specify who the fastest F1 Driver is, so I cannot answer this question from the provided context.In the final question, I asked something that isn�t covered by the provided data source. As seen in the response, the LLM acknowledges that it does not have an answer. This highlights a crucial aspect of working with individual LLM agents: while they are designed to generate responses to any query, their limitations in knowledge can be significant. In large-scale decision-making processes, relying on such agents without considering their limitations could lead to erroneous outcomes.Let's talk.Based on the demonstration, we can address a significant issue that many industries face with LLM agents. By utilizing the Retrieval-Augmented Generation (RAG) architecture, we enhance the reliability and trustworthiness of the LLM�s outputs. This approach helps mitigate various trust issues that users might encounter. While the RAG architecture may not completely eliminate all problems, it allows us to maintain control over the responses generated by the LLM. This method can be progressively refined to better suit specific use cases. Overall, the RAG architecture represents a fundamental mechanism for improving the accuracy and reliability of LLM-generated content.