Top 10 the most common languages in the latest Common Crawl dataset on ChatGPT, based on the primary language detected in HTML pages:​

Language CodeLanguage NamePercentage (%)
engEnglish43.37
deuGerman5.59
jpnJapanese5.40
rusRussian6.05
spaSpanish4.64
itaItalian2.42
porPortuguese2.33
polPolish1.82
indIndonesian1.17
cesCzech1.02

Other detail on below present latest crawl: CC-MAIN-2025-13 on Chat GPT

crawlCC-MAIN-2025-05CC-MAIN-2025-08CC-MAIN-2025-13
language
eng43.369443.379443.9338
rus6.04606.09825.9294
deu5.58575.40395.1154
jpn5.40465.16414.8624
zho4.67195.09425.4135
spa4.64174.54614.4426
fra4.52214.29504.2269
<unknown>2.69493.02243.0027
ita2.41522.39972.4181
por2.32592.32742.3679
nld1.91671.79411.7982
pol1.82481.74701.6923
tur1.17501.14111.1405
ind1.16761.18451.2009
vie1.04271.05511.0498
ces1.02251.00831.0380
fas0.72500.72740.7350
kor0.71830.80780.8028
ara0.66400.68560.6837
swe0.65470.64150.6621
ron0.61890.63040.6522
ukr0.60460.60750.6240
ell0.57400.58330.5902
hun0.50230.50210.4990
dan0.48500.46710.4630
tha0.45540.50370.4533
fin0.36780.35630.3616
slk0.36170.35680.3684
nor0.31420.30420.3119
bul0.29850.31030.3027
heb0.27010.25630.2571
hrv0.22380.22580.2295
srp0.21740.21520.2157
hin0.19510.20400.2027
cat0.18760.18490.1767
lit0.16590.16270.1659
slv0.15040.15090.1552
est0.13450.13060.1308
lat0.11950.10600.1061
ben0.10890.11470.1107
lav0.08880.08780.0854
msa0.07810.07850.0774
aze0.05720.06300.0612
bos0.05200.05710.0538
nep0.05190.05520.0553
sqi0.04520.04870.0492
isl0.04230.03890.0388
tam0.04220.04570.0437
kat0.03960.04150.0444
mkd0.03610.03800.0387
eus0.03470.03460.0328
glg0.03200.02970.0305
hye0.03180.03270.0328
urd0.02760.03030.0300
kaz0.02720.02980.0318
mar0.02580.02780.0290
mal0.02280.02510.0253
uzb0.01990.02260.0228
tel0.01800.01980.0195
nno0.01620.01670.0135
mon0.01460.01620.0153
bel0.01430.01490.0165
mya0.01290.01370.0148
kan0.01250.01350.0138
guj0.01200.01240.0125
cym0.01090.01070.0120
khm0.01040.01090.0106
sin0.00980.00940.0092
kir0.00970.00910.0111
afr0.00870.00990.0095
swa0.00850.00830.0095
tgl0.00780.00830.0087
epo0.00740.00740.0085
pan0.00690.00700.0068
tat0.00690.00810.0082
gle0.00660.00740.0074
kur0.00630.00680.0069
tgk0.00630.00650.0069
ori0.00560.00600.0061
som0.00460.00520.0048
fao0.00450.00430.0044
ltz0.00380.00380.0041
oci0.00340.00330.0037
lao0.00330.00400.0039
mlt0.00300.00320.0032
pus0.00300.00310.0041
san0.00300.00320.0032
amh0.00280.00310.0033
mlg0.00280.00350.0041
bak0.00260.00290.0032
hau0.00260.00270.0032
asm0.00250.00250.0023
div0.00220.00250.0024
jav0.00220.00230.0023
fry0.00200.00270.0022
cos0.00190.00180.0021
kin0.00190.00210.0023
war0.00190.00230.0021
bre0.00180.00290.0023
hat0.00180.00200.0019
tuk0.00170.00200.0022
ceb0.00150.00170.0018
bod0.00140.00150.0015
snd0.00140.00160.0016
yid0.00140.00170.0015
gla0.00130.00130.0013
mri0.00130.00130.0013
zul0.00120.00130.0011
roh0.00100.00100.0010
sun0.00100.00100.0012
uig0.00100.00120.0012
yor0.00090.00080.0008
kal0.00080.00080.0008
tir0.00080.00070.0007
xho0.00080.00080.0007
grn0.00070.00070.0007
sna0.00070.00060.0007
haw0.00060.00060.0006
ibo0.00060.00060.0006
ina0.00060.00050.0005
orm0.00060.00050.0007
que0.00060.00050.0005
sag0.00060.00010.0007
smo0.00060.00060.0006
sot0.00060.00060.0005
abk0.00050.00060.0005
bih0.00050.00050.0005
glv0.00050.00040.0009
hmn0.00050.00060.0006
nya0.00050.00050.0005
sco0.00050.00060.0006
vol0.00050.00050.0005
ile0.00040.00030.0003
kha0.00040.00040.0003
lin0.00030.00030.0003
syr0.00030.00040.0004
dzo0.00020.00020.0002
iku0.00020.00020.0002
lug0.00020.00020.0001
aar0.00010.00010.0001
aka0.00010.00020.0001
aym0.00010.00000.0001
bis0.00010.00010.0001
crs0.00010.00010.0001
fij0.00010.00010.0001
ipk0.00010.00010.0001
mfe0.00010.00010.0001
nso0.00010.00010.0001
ton0.00010.00010.0001
tsn0.00010.00010.0001
tso0.00010.00010.0001
ven0.00010.00000.0000
wol0.00010.00010.0001
zha0.00010.00020.0001
chr0.00000.00000.0000
got0.00000.00000.0000
kas0.00000.00000.0000
lif0.00000.00000.0000
nau0.00000.00000.0000
run0.00000.00010.0001
ssw0.00000.00000.0000
sux0.00000.00000.0000

Refer : From https://commoncrawl.github.io/cc-crawl-statistics/plots/languages.html?utm_source=chatgpt.com