Zaɓi Harshe

Bigramomin da ba su yuwu ba: Raunin da ke cikin Maɓallan BPE na Matakin Byte

Bincike kan maɓallan da ba su cika ba a cikin maɓallan BPE na matakin byte da rauninsu ga bigramomin da ba su yuwu ba waɗanda ke haifar da halayen haɗari a cikin LLMs.
computationaltoken.com | PDF Size: 0.3 MB
Kima: 4.5/5
Kimarku
Kun riga kun ƙididdige wannan takarda
Murfin Takardar PDF - Bigramomin da ba su yuwu ba: Raunin da ke cikin Maɓallan BPE na Matakin Byte

Teburin Abubuwan Ciki

1. Gabatarwa

Maɓallan yana aiki azaman muhimmiyar hanyar haɗi tsakanin rubutun da mutum zai iya karantawa da maɓallan da tsarin samfurin zai iya sarrafawa a cikin manyan harsunan harsuna (LLMs). Binciken kwanan nan ya fallasa manyan raunuka a cikin wannan muhimmin sashi, musamman a cikin maɓallan BPE na matakin byte. Wannan takarda tana binciken maɓallan da ba su cika ba—maɓallan da ba za a iya fassara su ba tare da ɓatattun bytes waɗanda ke fitowa daga maɓallan BPE na matakin byte—da kuma saukin da ake iya amfani da su ta hanyar bigramomin da ba su yuwu ba.

Tushen raunin ya samo asali ne daga maɓallan da ba su cika ba suna dogaro sosai da maɓallan da ke kusa don yin fassarar da ta dace. Lokacin da aka haɗa su da maɓallan da ba a sani ba a cikin haɗuwa da ba a rarraba su ba, waɗannan maɓallan da ba su cika ba suna zama masu rauni kuma suna saurin haifar da halayen haɗari a cikin LLMs. Bincikenmu ya nuna cewa wannan raunin yana ci gaba ko da lokacin da maɓallan da aka haɗa su sun horar da kyau, wanda ya bambanta shi da batutuwan maɓallan da aka gano a baya.

Ragewa 90%

Rage haɗari a cikin Llama3.1 tare da madadin maɓallan

Bigramomi 1.47M

Matsakaicin bigramomin da ba su cika ba a cikin maɓallan Command-R-v01

Samfura 6

An gwada su a cikin iyalai daban-daban na LLM

2. Tushen Maɓallan BPE

2.1 Aiwatar da BPE na Matakin Byte

BPE na matakin byte yana faɗaɗa tsarin BPE na al'ada ta hanyar aiki kai tsaye akan bytes da aka ɓoye UTF-8 maimakon haruffan Unicode. Algorithm ɗin yana haɗa mafi yawan nau'i-nau'i na bytes ko jerin bytes akai-akai bisa ga dabarar:

$$\text{haɗa}(x,y) = \arg\max_{(x,y) \in V} \frac{\text{ƙidaya}(x,y)}{\text{ƙidaya}(x) \cdot \text{ƙidaya}(y)}$$

inda $V$ ke wakiltar ƙamus na yanzu kuma $\text{ƙidaya}(x,y)$ yana nuna yawan nau'i-nau'i na byte $(x,y)$ a cikin tarin horo.

2.2 Ma'anar Maɓallan da ba su cika ba

Maɓallan da ba su cika ba sune maɓallan matakin byte waɗanda ba za a iya fassara su su kaɗai zuwa ingantattun haruffan Unicode ba. Waɗannan maɓallan suna ɗauke da ɓatattun bytes waɗanda ke buƙatar haɗawa da takamaiman maɓallan da ke kusa don samar da jerin UTF-8 na doka. Raunin ya taso saboda:

  • Maɓallan da ba su cika ba ba su da ma'anar ma'ana ta kanta
  • Suna nuna dogaro mai ƙarfi akan maɓallan maƙwabta
  • Tsarin bytes ɗin su yana haifar da shubuha game da fassarar

3. Hanyar Bigramomin da ba su yuwu ba

3.1 Fasahar Gina

Bigramomin da ba su yuwu ba sune haɗuwa da aka gina a hankali na maɓallan da ba su cika ba guda biyu waɗanda ke samar da nau'i-nau'i da ba a rarraba su ba. Gina yana bin waɗannan ƙa'idodin:

  1. Zaɓi maɓallan da ba su cika ba daga ƙamusin maɓallan
  2. Tabbatar cewa haɗuwar tana haifar da ingantattun jerin bytes na UTF-8
  3. Ƙara yawan rashin yuwuwar ƙididdiga na haɗin gwiwa
  4. Tabbatar cewa bigram ɗin bai bayyana a cikin bayanan horo ba

3.2 Binciken Rauni

Hanyar rauni tana aiki ta hanyar manyan tashoshi guda uku:

Shubuha game da Fassarar: Maɓallan da ba su cika ba suna haifar da rashin tabbas game da fassarar wanda ke yaɗuwa ta cikin sassan samfurin. Wakilcin lissafi yana nuna yadda ɓangarorin da aka saka don maɓallan da ba su cika ba $e_i$ suke nuna bambanci mafi girma:

$$\text{Bambanci}(e_i | \text{bai cika ba}) > \text{Bambanci}(e_j | \text{cika})$$

Raunin Mahallin: Tsarin dogaro yana sa waɗannan maɓallan su zama masu rauni lokacin da aka cire su daga yanayin da ake tsammani, kama da rashin kwanciyar hankali da aka gani a cikin misalan adawa daga binciken hangen nesa na kwamfuta.

4. Sakamakon Gwaji

4.1 Ƙimar Haɗari

Gwaje-gwajenmu a cikin iyalai daban-daban na LLM sun bayyana bambance-bambance masu ban mamaki a cikin ƙimar haɗari tsakanin daidaitattun maɓallan da madadin maɓallan na jimloli iri ɗaya:

Samfuri Daidaitaccen Maɓallan Madadin Maɓallan Ragewa
Llama3.1 45.2% 4.5% 90.0%
Qwen2.5 38.7% 6.2% 84.0%
Mistral-Nemo 52.1% 8.9% 82.9%

4.2 Kwatancen Tsarin Samfura

Ma'aunin rauni ya bambanta sosai a cikin maɓallan, kamar yadda aka nuna a cikin cikakken bincikenmu:

Maɓallan Girman Ƙamus Maɓallan da ba su cika ba Bigramomin da ba su cika ba
Meta-Llama-3.1 128k 1,224 71k
Exaone-3.0 102k 1,222 36k
Qwen2.5 151k 1,320 39k
Command-R-v01 255k 2,956 1.47M

5. Tsarin Binciken Fasaha

Tushen Fahimta

Tsarin maɓallan BPE na matakin byte, yayin da yake da inganci a lissafi, yana gabatar da raunukan gine-gine na asali waɗanda ke haifar da makafin tsarin a cikin LLMs. Wannan ba kuskuren aiwatarwa kawai bane—laifin tsari ne a yadda maɓallan na zamani ke sarrafa rikitaccen Unicode.

Kwararar Ma'ana

Raunin ya bi tsari mai iya annabta: Rarraba matakin byte → Ƙirƙirar maɓallan da ba su cika ba → Samuwar dogaro na mahalli → Amfani da rashin yuwuwar ƙididdiga → Haifar da haɗari. Wannan sarkar ta nuna cewa maɓallan ba kawai aikin farko ba ne—yana da muhimmin matakin tsaro.

Ƙarfi & Kurakurai

Ƙarfi: Hanyar bincike tana da tsauri, tare da ingantaccen samfuri ta hanyar kwatanta samfura daban-daban da ma'auni. Tunanin bigramomin da ba su yuwu ba yana ba da takamaiman hanyar kai hari don gwada ƙarfin maɓallan.

Kurakurai: Takardar ba ta ba da fifiko kan kusurwar gurɓataccen bayanan horo ba. Yawancin haɗuwa "marasa yuwuwa" na iya nuna da gaske ƙayyadaddun yanayin rubutu na harsuna daban-daban maimakon kayan tarihi kawai.

Abubuwan Fahimta masu Aiki

Dole ne masu haɓaka LLM su ɗauki maɓallan a matsayin muhimman sassan tsaro, ba kayan aikin farko kawai ba. Aiwatar da gwaje-gwajen lafiyar maɓallan a lokacin aiki, ɗauki hanyoyin haɗin maɓallan, da gudanar da gwajin adawa musamman da aka yi niyya ga haɗuwar maɓallan da ba su cika ba.

Bincike na Asali: Tsarin Tsaron Maɓallan

Wannan bincike yana canja yadda ya kamata mu fahimci maɓallan a cikin yanayin tsaron LLM. Binciken ya nuna cewa maɓallan BPE na matakin byte suna haifar da raunuka na tsari waɗanda suka wuce gine-ginen samfurin ɗaya, suna tunawa da raunukan asali da aka gano a cikin tsarin ɓoyayyen bayanai na farko. Ba kamar batutuwan da aka rubuta da kyau game da maɓallan ɓarna ba—waɗanda suka fi shafa maɓallan da ba a horar da su ba—raunin maɓallan da ba su cika ba yana ci gaba ko da a cikin samfuran da aka horar da kyau, yana nuna matsala ta gine-gine mai zurfi.

Ragewar kashi 90% na ƙimar haɗari lokacin amfani da madadin maɓallan don jimloli iri ɗaya na shigarwa musamman yana da laifi. Wannan girman inganci yana nuna cewa aiwatar da BPE na matakin byte na yanzu yana gabatar da hayaniya mai yawa a cikin tsarin sarrafa samfurin. Idan aka kwatanta da wallafe-wallafen ƙarfin adawa a cikin hangen nesa na kwamfuta—inda aka yi bincike mai zurfi game da irin wannan raunin na gine-gine—sashen maɓallan ya fito daidai da raunin iyakar yanke shawara a cikin masu rarraba hoto.

Abin da ya sa wannan bincike ya zama mai jan hankali musamman shine haɗin sa da alaƙa da manyan matsalolin tsaron Unicode. Ƙungiyar Unicode ta daɗe tana faɗakarwa game da abubuwan da ake rikitarwa da raunin daidaitawa, amma wannan aikin yana faɗaɗa waɗannan damuwa zuwa yankin gine-ginen jijiyoyi. Gano cewa babban ƙamus na Command-R-v01 yana da alaƙa da bigramomin da ba su cika ba (1.47M vs 71k a Llama3.1) yana nuna cewa ƙara girman ƙamus ba tare da magance wannan matsala ta asali ba na iya ƙara yawan harin.

Idan muka duba gaba, wannan binciken ya kamata ya haifar da canji zuwa "maɓallan tsaro na farko" kama da yadda al'ummar ɓoyayyen bayanai suka rungumi abubuwan tsaro da za a iya tabbatar da su. Hanyoyin madadin maɓallan waɗanda ke rage haɗari sosai suna nuna zuwa hanyoyin haɗakarwa waɗanda ke haɗa ingancin BPE na matakin byte tare da ƙarfin hanyar matakin harafi ko ɓangaren kalma. Yayin da LLMs ke ƙara yawan aiki a cikin aikace-aikacen da ke da mahimmanci na tsaro, magance waɗannan raunukan matakin maɓallan ya zama ba kawai damuwa na ilimi ba amma wajibi ne na aiki.

6. Hanyoyin Gaba & Aikace-aikace

Aikace-aikacen Tsaro

  • Ƙa'idodin Maɓallan Ƙarfi: Haɓaka hanyoyin maɓallan waɗanda ke rage maɓallan da ba su cika ba yayin kiyaye inganci
  • Tsarin Gwajin Adawa: Tsarin sarrafa kai don gano raunin maɓallan yayin haɓaka samfurin
  • Sauraron Lokacin Aiki: Gano da rage hare-haren bigramomin da ba su yuwu ba a cikin tsarin samarwa

Damammakin Bincike

  • Bincike na rarrabe-rarrabe na maɓallan da ba su cika ba a cikin harsuna daban-daban
  • Haɗa kai tare da haɓakar samarwa don rage raunin mahalli
  • Haɓaka hanyoyin tabbatarwa na yau da kullun don kaddarorin tsaron maɓallan

Tasirin Masana'antu

Binciken yana da tasiri nan take ga:

  • Ma'auni na kimanta amincin LLM
  • Zane na maɓallan a cikin samfuran zamani na gaba
  • Tsarin ƙa'idodi don tsarin tsaron AI

7. Nassoshi

  1. Jang, E., Lee, K., Chung, J.-W., Park, K., & Shin, S. (2025). Improbable Bigrams Expose Vulnerabilities of Incomplete Tokens in Byte-Level Tokenizers. arXiv:2410.23684v2
  2. Rumbelow, J., & Watkins, M. (2023). SolidGoldMagikarp: A analysis of glitch tokens in large language models.
  3. Land, K., & Bartolo, A. (2024). Embedding layer heuristics for identifying glitch tokens.
  4. Wang, X., et al. (2024). Adversarial questions through tokenizer segmentation attacks.
  5. Petrov, A., et al. (2023). Tokenization fairness in multilingual models.
  6. Geiping, J., et al. (2024). Jailbreaking through token manipulation.
  7. Unicode Consortium. (2024). Unicode Security Considerations. Unicode Technical Report #36
  8. Vaswani, A., et al. (2017). Attention is All You Need. NeurIPS 2017