Transformers from Scratch

Question and Answers

In the beginning were the words. So very many words. Our first step is to convert all the words to numbers so we can do math on them.
Why we can not do math on words?
Here's why:
  1. Computational Nature of Machine Learning: Machine learning algorithms learn from data by calculating weights, biases, gradients, and making predictions. All these calculations are mathematical operations, which require numbers.
  1. Vector Spaces and Semantic Similarity: Words converted to numerical forms, like word embeddings, can be thought of as points in a vector space. The "distance" and "angle" between these points help the model understand the semantic relationship between words. This would not be possible with words in their raw, textual form.
  1. Consistency and Scalability: Assigning each word a unique number (or vector) allows for consistent interpretation by the machine learning model. It's a scalable solution because it doesn't matter if there are hundreds or millions of unique words, they can all be converted to numerical form.
  1. Data Encoding and Feature Extraction: Numerical representations enable feature extraction from text, like counting the frequency of each word, measuring the length of sentences, etc. These features help the model understand patterns in the text.
We start by choosing our vocabulary, the collection of symbols that we are going to be working with in each sequence.
What is Vocabulary?
In the context of Natural Language Processing (NLP), a "vocabulary" is the specific set of unique words that a machine learning model has been trained on and can recognize.
Why is it called higher dimensional space whch is lower dimensional then?
I understand how that might be confusing! When we say "high-dimensional" in machine learning or data science, it's a relative term that generally means "more than two or three." That's because we live in a three-dimensional world and are used to visualizing data in two dimensions (like on a piece of paper or a computer screen), so anything more than that is "high" for us.
However, when we say "lower-dimensional," it doesn't mean less than two or three. Instead, it usually refers to a space with fewer dimensions than the original data. For instance, if we start with data in a 1000-dimensional space, a 300-dimensional space would be "lower-dimensional" compared to the original, even though 300 is still quite high by everyday standards.
It's all about the context: "high-dimensional" means "having many dimensions compared to what we're used to," while "lower-dimensional" means "having fewer dimensions compared to the original data."
For example, think about a group of people from different countries attending a music festival. If we describe each person by their age, height, weight, country of origin, and favorite music genre, we have a 5-dimensional space. But if we decide that for some analysis we only care about age and music preference, we'd be working in a 2-dimensional space – that's "lower-dimensional" compared to the original 5 dimensions, even though age and music preference are still two distinct dimensions.
Does this clarify the concepts of "high-dimensional" and "lower-dimensional" for you?
Is not it still in 2D space? Means 3 words or 10k words all we are representing in a 2 dimensional array or matrix?
notion image
Imagine you're playing a video game with a character that can move in a virtual world. This world has three dimensions: forward/backward, left/right, and up/down.
Now, on your screen, you only see a 2D representation of this world, right? The screen itself doesn't have depth. But you, as the player, understand that your character moves in a 3D environment. If your character goes "into" the screen, you understand it as moving forward, even though, physically, it's all displayed on the same flat screen.
You can think of the term "embedding" in this context as a way to include additional information within something else. It's similar to how a letter can be embedded within a word. For example, in the word "Apple", the letter "A" is embedded within it. Even though you don't see "A" standing alone, you know it's part of the word, giving it meaning.
Similarly, when we talk about dimensions being "embedded" in a vector, it's a way of saying that the vector contains within it a certain amount of dimensional information. This isn't something you can necessarily "see" just by looking at the vector, but it's inherently part of how the vector is structured and interpreted.
If we consider a vector for a word, like "Apple": [1, 0, 0, 0], each position in that vector is a different "dimension". The entire vector exists in a space that has as many dimensions as there are positions in the vector. Even though we write it down in a one-dimensional row or column (which is a 2D table when we include multiple words), the vector is representing a point in a four-dimensional space because it has four positions.
So even though we write it in a "flattened" form, the dimensional information is "embedded" or included within the structure of the vector. When we work with this vector in computations, we treat it as a point in a four-dimensional space, not just a simple list of numbers.
This is similar to what we mean by dimensions being "embedded" in the vectors. Even though we write these vectors in a 2D structure (like rows and columns in a matrix or a table), conceptually, each vector represents a point in a higher-dimensional space. The dimensionality comes from the length of the vectors, not from the rows and columns in the table.
Going back to the fruits example:
  • When we say "Apple" is [1, 0, 0, 0], we can also say "Apple" exists at the point (1, 0, 0, 0) in a four-dimensional space.
We 'embed' the concept of dimensionality in our understanding of the data, rather than it being explicitly present in the physical or visual structure of the data.
Does this help, or is there another part of the concept you'd like me to explain further?
The first thing that becomes apparent is that, when trying to predict the word that comes after ran, we no longer look at just one line, but rather a whole set of them. We've moved out of the Markov realm now. Each row no longer represents the state of the sequence at a particular point. Instead, each row represents one of many features that may describe the sequence at a particular point.
What is Markov Realm
In the context of NLP, Markov realm refers to the set of all possible sequences of words that can be generated by a Markov chain. A Markov chain is a stochastic model that describes a sequence of events in which the probability of each event depends only on the previous event. In other words, the future state of a Markov chain is determined solely by its current state.
The Markov realm is important in NLP because it can be used to generate text that is similar to a given text. For example, if you have a text about dogs, you can use the Markov realm to generate new text about dogs. The new text will be similar to the original text, but it will not be exactly the same. This can be useful for generating creative text, such as poems or stories, or for generating text that is tailored to a specific audience.
The Markov realm can also be used to analyze text. For example, you can use the Markov realm to identify the most common words and phrases in a text. You can also use the Markov realm to identify the most likely next word in a sequence. This information can be used to improve the performance of NLP tasks such as machine translation and text summarization.
The Markov realm is a powerful tool that can be used for a variety of tasks in NLP. It is a relatively simple concept, but it can be used to generate creative text, analyze text, and improve the performance of NLP tasks.
Define stochastic model
Stochastic models can be used to make predictions about the future, to analyze historical data, and to design experiments. They are a powerful tool that can be used to understand and predict complex systems that involve uncertainty.
Here are some examples of stochastic models:
  • A weather forecast that predicts the probability of rain.
  • A stock market prediction that predicts the probability of a stock price going up or down.
  • A medical test that predicts the probability of a patient having a disease.
  • A traffic model that predicts the probability of traffic congestion.
  • A risk assessment that predicts the probability of an accident or disaster.
Stochastic models are used in a wide variety of applications, and they are becoming increasingly important as we try to understand and predict the world around us.
How does Numpy make matrix multiplication faster?
NumPy is a highly optimized numerical computing library for Python that provides efficient multidimensional array operations, including matrix multiplication. NumPy achieves faster matrix multiplication primarily through two mechanisms:
  1. Efficient C-based implementation: NumPy's core functionality is implemented in C, which allows it to bypass the Global Interpreter Lock (GIL) of Python. The GIL is a mechanism in Python that ensures only one thread executes Python bytecode at a time. By leveraging C implementation, NumPy can perform computationally intensive operations without being limited by the GIL.
  1. Utilization of optimized libraries: NumPy utilizes optimized, low-level libraries such as BLAS (Basic Linear Algebra Subprograms) or LAPACK (Linear Algebra Package) to carry out matrix multiplication efficiently. These libraries are often written in lower-level languages like Fortran or C, and they are designed to perform matrix operations with high performance. NumPy integrates with these libraries to leverage their optimized implementations and take advantage of hardware-specific optimizations.
By combining these two strategies, NumPy is able to achieve significant performance gains in matrix multiplication compared to pure Python implementations.
Does Numpy use vectorization
Yes, NumPy utilizes vectorization to optimize performance. Vectorization is the process of executing operations on entire arrays or matrices instead of looping over individual elements. It allows NumPy to take advantage of underlying hardware features like SIMD (Single Instruction, Multiple Data) instructions, which can process multiple data elements in parallel.
When performing operations on NumPy arrays, such as matrix multiplication, addition, or trigonometric functions, the operations are applied element-wise to the entire array without the need for explicit loops. This allows for efficient execution and can significantly speed up computations.
Vectorized operations in NumPy are typically implemented using highly optimized libraries and low-level languages like C or Fortran, which further enhance performance. By utilizing vectorization, NumPy minimizes the overhead associated with Python loops and achieves faster execution times.
What are Features?
In the context of machine learning and data analysis, features refer to the individual measurable properties or characteristics of a data point. Features are also known as predictors, variables, attributes, or inputs.
Features provide quantitative or qualitative information about the data points, and they are used to make predictions, classify data, or uncover patterns and relationships in the dataset.
Here are a few key points about features:
  1. Representation of Data: Features are used to represent the data points in a machine learning or statistical model. Each data point is described by a set of features that capture relevant information about that particular instance.
  1. Dimensions of Data: The number of features defines the dimensionality of the dataset. If each data point is described by p features, the dataset can be represented as an p-dimensional space. For example, in a dataset of images, each pixel value could be a feature, leading to a high-dimensional feature space.
  1. Types of Features: Features can take different forms, depending on the nature of the data. They can be numerical (continuous or discrete), categorical (nominal or ordinal), binary (true/false), or even textual. For example, in a dataset of houses, features could include the number of bedrooms (numerical), the neighborhood (categorical), or whether the house has a garden (binary).
  1. Feature Selection and Engineering: Feature selection and feature engineering are important steps in the machine learning pipeline. Feature selection involves identifying the most relevant features that contribute the most to the predictive power of the model. Feature engineering involves creating new features or transforming existing ones to improve model performance.
  1. Feature Importance: In many models, such as decision trees or linear regression, feature importance can be determined. It helps to understand which features have the most significant influence on the model's predictions or outcomes.
  1. Feature Vector: In machine learning, features are often represented as a feature vector, which is a one-dimensional array or list that concatenates all the features of a data point. Each element of the feature vector corresponds to a specific feature.
The selection and quality of features play a crucial role in the performance and interpretability of machine learning models. Effective feature representation allows the model to learn patterns and make accurate predictions based on the available information.
I hope this clarifies the concept of features! If you have any further questions, feel free to ask.
While representing in matrix if a word is absent why do we assign it to -1 and not to 0?
There are a few reasons why we might assign a value of -1 to an absent word in a matrix instead of 0.
  • To distinguish between absent and present words: By assigning a different value to absent words, we can easily distinguish between them and present words. This can be helpful for tasks such as searching for words in a matrix or counting the number of words in a matrix.
  • To represent the absence of a word with a negative value: In some cases, it may be useful to represent the absence of a word with a negative value. For example, we might use a matrix to represent the presence or absence of certain features in a document. In this case, a value of -1 could be used to indicate that a feature is absent.
  • To improve the performance of certain algorithms: Some algorithms that operate on matrices may perform better if absent words are assigned a different value than present words. For example, the binary search algorithm can be used to search for a word in a matrix. However, if all absent words are assigned a value of 0, the binary search algorithm will not be able to distinguish between present and absent words. By assigning a different value to absent words, the binary search algorithm can be used to search for words in a matrix more efficiently.
Ultimately, the decision of whether to assign a value of -1 or 0 to absent words in a matrix depends on the specific application.
We find out a relation between the last word and all the previous words which came before the last word. The combinations which helps to determine the next word are not masked, but the rest are masked. This masking is called as Attention. This word combinations will be learned via back propagation.
Not smooth gradient and not good for back-propagation
Not smooth gradient and not good for back-propagation
Smooth gradient landscape good for back-propagation
Smooth gradient landscape good for back-propagation
Why projecting words from their one-hot representation to an embedded space involves a matrix multiplication?
Projecting words from their one-hot representation to an embedded space involves a matrix multiplication because the embedding process aims to transform the high-dimensional one-hot vectors into lower-dimensional continuous vector representations, also known as word embeddings.
Here's why matrix multiplication is used in this process:
  1. Mapping to a Lower-Dimensional Space: Word embeddings provide a more compact and meaningful representation of words compared to one-hot vectors. These embeddings capture semantic and contextual information about words and enable better handling of language-related tasks.
  1. Transformation Matrix: To project the one-hot vectors into an embedded space, a transformation matrix is used. This matrix serves as a set of learnable weights that determine how each word is represented in the embedded space.
  1. Mapping Relationship: By multiplying the one-hot vector with the transformation matrix, we can effectively map the word's presence or absence to a dense vector representation in the embedded space. The transformation matrix assigns weights to each dimension of the one-hot vector, indicating the significance or relevance of that dimension for the given word.
  1. Reduced Dimensionality: The embedded space typically has a lower dimensionality compared to the original vocabulary size. This dimension reduction helps in capturing meaningful relationships and similarities between words while reducing the computational complexity of subsequent tasks.
Matrix multiplication enables the linear transformation of the one-hot vector into the embedded space. Each element of the resulting embedded vector is computed as the weighted sum of the one-hot vector's elements, with the weights determined by the transformation matrix.
It's important to note that the transformation matrix is typically learned during the training process using techniques like neural networks, specifically in models like Word2Vec, GloVe, or transformer-based architectures.
Overall, the matrix multiplication allows us to project words from the one-hot representation to an embedded space, facilitating more effective language processing and capturing semantic relationships between words.
If you have any further questions or need more clarification, please let me know!

Further reads

⚠️Disclaimer: All the screenshots, materials, and other media documents used in this article are copyrighted to the original platform or authors.