I have a question for the group related to BPE tok...
# ai-reading-club
a
I have a question for the group related to BPE tokenizer. It appears that tiktoken (openAI BPE tokenizer) is multilingual(i tried it with multiple languages and it is able to generate token for them all). But when i see the vocabulary list i don’t see words/tokens related to the languages that i want to be tokenized and yet it is able to encode and decode that language. I am adding the vocab list for both gpt-2 and gpt-4 for reference. GPT-4 has foreign language but not the language that i am trying to tokenize. Yet both of them are able to encode/decode the text. Try to tokenize in your native language other than english You can use the below code: tokenizer = tiktoken.get_encoding(“gpt2”) #GPT-2 tokenizer = tiktoken.get_encoding(“o200k_base”) #GPT-4 Is anyone aware of this and how is this achieved.
a
@acceptable-knife-37130 This might be a good reference https://arxiv.org/abs/1508.07909
a
Thanks @average-finland-92144 I managed to create a custom BPE tokenizer, which has 72344 token. It was trained on “Sherlock_Homes” Novels. One observation: The token count for my custom tokenizer is always less compared to tiktoken on average(for both gpt-2 and O1) for same text. Not sure if this would help me when i start training my model. Will keep you guys posted!!!