Flyte enables production-grade orchestration for machine learning workflows and data processing created to accelerate local workflows to production.

Flyte

*I have a question for the group related to BPE tokenizer.*
*It appears that tiktoken (openAI BPE tokenizer) is multilingual(i tried it with multiple languages and it is able to generate token for them all). But when i see the vocabulary list i don’t see words/tokens related to the languages that i want to be tokenized and yet it is able to encode and decode that language.*

*I am adding the vocab list for both gpt-2 and gpt-4 for reference. GPT-4 has foreign language but not the language that i am trying to tokenize. Yet both of them are able to encode/decode the text.*

*Try to tokenize in your native language other than english*

*You can use the below code:*

tokenizer = tiktoken.get_encoding(“gpt2”) #GPT-2
tokenizer = tiktoken.get_encoding(“o200k_base”) #GPT-4


*Is anyone aware of this and how is this achieved.*

<@U085FDMUZFD> This might be a good reference 
<https://arxiv.org/abs/1508.07909|https://arxiv.org/abs/1508.07909>

Thanks <@U04H6UUE78B>
I managed to create a custom BPE tokenizer, which has 72344 token.

It was trained on “Sherlock_Homes” Novels.

One observation: The token count for my custom tokenizer is always less compared to tiktoken on average(for both gpt-2 and O1) for same text.

Not sure if this would help me when i start training my model.

Will keep you guys posted!!!