Flyte enables production-grade orchestration for machine learning workflows and data processing created to accelerate local workflows to production.

Flyte

Hi,

I have a question related to building a large language model from scratch book by Sabastian. The discussion happended 2 months back but i have a few questions. If anyone is aware and can help me out with that:

In the Chapter where we code Attention and in the section for MultiHeadAttention code (_*section 3.6.2*_), the code given as an example is below:
```context_vec = context_vec.contiguous().view(b, num_tokens, self.d_out)
context_vec = self.out_proj(context_vec)```
Here is my question:

a) Why do i need to do a contiguous call for merging the dimensions. I can write it as below and get the same result. Is there an issue, if i do it this way:
`context_vec = context_vec.view(b, num_tokens, self.d_out)`

b) I don’t understand the use of a *output projection linear layer*. In the book it says that bcos GPT uses it we are doing the same without giving an explanation why is it needed?
`context_vec = self.out_proj(context_vec)`