-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Searching does not work properly for CJK ideographs #1004
Comments
You have found a client-side search, which is enabled only for secret chat messages. The best way to improve it is to contibute directly to SQLite's FTS extension. |
OK I see. So, there is no way to check out Telegram server side code? |
No. |
The search of telegram is based on "word", and the interval of "word" is punctuation or space. |
any updates on this? cannot effectively searching CJK characters is a huge pain using Telegram |
This is a huge trouble for people who use CJK language, but telegram doesn't seem to plan to solve the problem, don't know why? Because telegram users hardly use CJK language, or is it technically not easy to achieve? |
Actually, I found lots of CJK users on Telegram. But the search issue is limiting the number to grow. |
Still waiting for fix... this is important |
I was eager to have this feature before. Now I eventually switched to Discord with my friends. |
This is feature is import for me |
They think their language is the entire world. What a narrow prejudice! |
I agree with you.
Чт, 21 нояб. 2024 г. в 09:42, Platway ***@***.***>:
… They think their language is the entire world. What a narrow prejudice!
—
Reply to this email directly, view it on GitHub
<#1004 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AZCSN4QJDZD42ELQJIJTBFL2BWFHHAVCNFSM4MHZ26BKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TENBZGAZDQNJUGEZA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Hello,
Telegram's searching ability is poor when it comes to Chinese-Japanese-Korean ideographs, which leads to difficulty in promoting it around Taiwan.
I tried to find out the cause. I took a look in MessagesDb.cpp and find that Telegram uses SQLite to restore messages and FTS5 module to make a search table.
And that is the point. FTS5 splits string into phrases, putting them into hash table. Suppose there is a text "Telegram search". Only "Telegram" and "search" would match the text, whereas either "Tele" or "a" would get no result. Unfortunately, Chinese characters are all categorized into "Letter", which is considered to be token. Hence, the whole Chinese text like "我好想要中文搜尋", containing consecutive Chinese chars without any delimiter, would be viewed as a single phrase. That is, none of "想要", "中文" or "搜尋" would match the result.
I have two ideas. The simple one, we can insert invisible separator such as '\a' between every Chinese char. The other one, we may implement a custom tokenizer.
Nonetheless, I can hardly realize what MessagesDb.cpp works. Actually I don't know how Telegram performs search tasks or how search_id is generated.
So, how can we solve this problem? I would like to make my efforts to contribute to Telegram.
Thanks.
The text was updated successfully, but these errors were encountered: