Share your repls and programming experiences

← Back to all posts
Do you know how to prevent Zalgo Text?

What is Zalgo Text?

Zalgo Text ordinarily falls within the CC and CD Unicode ranges. in Unicode, character rendering does not utilize a straightforward character cell show where each glyph fits into a box with a given stature. Combining marks may be rendered over, underneath, or inside a base character So you'll effectively build a character arrangement, comprising of a base character and “combining above” marks, of any length, to reach any wanted visual tallness, expecting that the rendering program acclimates to the Unicode rendering demonstrate. Such a grouping has no meaning of course, and indeed a structure may produce it (e.g., given a console with an appropriate driver). And you'll blend “combining above” and “combining below” marks.

How can Zalgo Text be prevented?

To prevented Zalgo Text this things can be followed:

  1. Split the incoming text into smaller units (words or sentences);

  2. Render each unit on the server with your font of choice (with a huge line height and lots of space below the baseline where the Zalgo "noise" would go);

  3. Train a machine learning algorithm to judge if it looks too "dark" and "busy";

  4. If the algorithm's confidence is low defer to human moderators.

Finally, if you're looking to detect, rather than unconditionally remove, Zalgo text you could perform character frequency analysis. The program below does that for each line of the input file. The function is_zalgo calculates a "Zalgo score" for each word of the string it is given (the score is the number of potential Zalgo characters divided by the total number of characters). It then looks if the third quartile of the words' scores is greater than THRESHOLD. If THRESHOLD equals 0.5 it means we're trying to detect if one out of each four words has more than 50% Zalgo characters. (The THRESHOLD of 0.5 was guessed and may require adjustment for real-world use.) This type of algorithm is probably the best in terms of payoff/coding effort.