Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Searching does not work properly for CJK ideographs #1004

Open
nevikw39 opened this issue Apr 14, 2020 · 12 comments
Open

Searching does not work properly for CJK ideographs #1004

nevikw39 opened this issue Apr 14, 2020 · 12 comments

Comments

@nevikw39
Copy link

Hello,

Telegram's searching ability is poor when it comes to Chinese-Japanese-Korean ideographs, which leads to difficulty in promoting it around Taiwan.

I tried to find out the cause. I took a look in MessagesDb.cpp and find that Telegram uses SQLite to restore messages and FTS5 module to make a search table.

And that is the point. FTS5 splits string into phrases, putting them into hash table. Suppose there is a text "Telegram search". Only "Telegram" and "search" would match the text, whereas either "Tele" or "a" would get no result. Unfortunately, Chinese characters are all categorized into "Letter", which is considered to be token. Hence, the whole Chinese text like "我好想要中文搜尋", containing consecutive Chinese chars without any delimiter, would be viewed as a single phrase. That is, none of "想要", "中文" or "搜尋" would match the result.

I have two ideas. The simple one, we can insert invisible separator such as '\a' between every Chinese char. The other one, we may implement a custom tokenizer.

Nonetheless, I can hardly realize what MessagesDb.cpp works. Actually I don't know how Telegram performs search tasks or how search_id is generated.

So, how can we solve this problem? I would like to make my efforts to contribute to Telegram.

Thanks.

@levlam
Copy link
Contributor

levlam commented Apr 14, 2020

You have found a client-side search, which is enabled only for secret chat messages. The best way to improve it is to contibute directly to SQLite's FTS extension.
Search for messages in all other chats is done server-side, so there is no way to improve it on TDLib's side.

@nevikw39
Copy link
Author

OK I see.

So, there is no way to check out Telegram server side code?

@levlam
Copy link
Contributor

levlam commented Apr 15, 2020

No.

@kouhe3
Copy link

kouhe3 commented Jun 13, 2020

The search of telegram is based on "word", and the interval of "word" is punctuation or space.
This is an English based search method, which is very convenient for English search. For example, "hello" can't be found by "he", and "hello" must be used. This is in line with the English context. When I want to find "he" messages, I don't want to see "hello" messages. But this way is not convenient for Chinese and other languages. Chinese is based on Chinese character

https://congcong0806.github.io/2019/11/04/TelegramSearch/

@nathancchu
Copy link

any updates on this? cannot effectively searching CJK characters is a huge pain using Telegram

@tylvn
Copy link

tylvn commented Apr 9, 2022

This is a huge trouble for people who use CJK language, but telegram doesn't seem to plan to solve the problem, don't know why? Because telegram users hardly use CJK language, or is it technically not easy to achieve?

@githubhjs
Copy link

This is a huge trouble for people who use CJK language, but telegram doesn't seem to plan to solve the problem, don't know why? Because telegram users hardly use CJK language, or is it technically not easy to achieve?

Actually, I found lots of CJK users on Telegram. But the search issue is limiting the number to grow.

@devuterian
Copy link

Still waiting for fix... this is important

@tdlib tdlib deleted a comment from Valentyna12 Feb 8, 2024
@tonytonyjan
Copy link

I was eager to have this feature before.

Now I eventually switched to Discord with my friends.

@TokuiNico
Copy link

This is feature is import for me

@Platway
Copy link

Platway commented Nov 21, 2024

They think their language is the entire world. What a narrow prejudice!

@Valentyna12
Copy link

Valentyna12 commented Dec 21, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy