Sublime Forum

Sublime Text 4180 memory leak (and high usage)

#1

platform: windows 7 x64
sublime text version: sublime_text_build_4180_x64

this issue may be relate to syntax engine (tested with javascript files).

  • file a.js: simple file with content generated by default ‘for’ snippet:

for (var i = Things.length - 1; i >= 0; i–) {
Things[i]
}

  • file main.js: come from cpptools of Visual Studio Code: you can find it as “\cpptools\dist\src\main.js”, the file size is ~5.39MB; (can be extracted from cpptools-windows-x64.vsix)
  • file test.js: repeat the content of a.js to match the size of main.js.

reproduce:

  1. download https://www.sublimetext.com/download_thanks?target=win-x64-portable;
  2. launch sublime_text.exe to let it initialize something (e.g. create package caches);
  3. close and re-launch sublime, use taskmgr.exe to record its commit memory usage, this is ~22.5 MB in my laptop;
  4. then open a.js, record again the commit memory usage: ~25.8 MB;
  5. then open test.js, the memory usage will be: ~54 MB;
  6. close test.js, see that the memory usage will be: ~24.4 MB;

  1. close and re-launch sublime, record commit memory usage for this fresh instance: ~23 MB;
  2. then open a.js, memory usage will be: ~23.6 MB;
  3. then open main.js, memory usage will be: ~124.6 MB;
  4. finally close main.js, the memory usage remains at: ~122.3 MB;

the expected behavior is:
a) the memory usage return to <25 MB after close main.js;
b) the memory usage for syntax-highlighting of main.js is too high compared to test.js;

0 Likes

#2
  1. In many cases it is up to the operating system to free up used heap memory again. The memory might already be freed up by ST but the heap chunk was not yet reclaimed by Windows. It will do so only on certain triggers such as running out of memory situations.

  2. A memory leak would mean ST’s memory foot print to grow constantly when opening main.js again and again. What happens is it moving from 145MB to 174MB and back to 145MB whenever main is opened and closed again.

    This indicates more of a ST allocated a piece of memory for something it finds useful to keep available.

  3. The provided example compares apples with oranges. Main.ts contains hundreds or even thousands of functions and other entities which are added to symbol list and/or index, while thousands of simple for loops don’t.

    And that’s probably already the answer for why memory footprint stays at 145MB. It is most likely just the required RAM to hold ST’s local symbols and global symbol index.

1 Like

#3

The index isn’t what’s using memory here, it’s simply the syntax engine’s cache. But yea this isn’t a memory leak.

0 Likes

#4

@deathaxe thank you for your detailed invesitgation to this post and gives reasonable answers. I was marking it as memory leak because usually I use sublime text for daily work. It uses ~40MB started in the morning but somethings end up reaching ~800MB at night. I did’t see why ST uses so much increasing memory and occationally one day I find that the main.js can cause the behaviors described above.

It is good to see ST not “leaking” more and more RAM when close and reopen main.js. But I still have some questions.

  1. From the behavior records we see ST get the memory of test.js free’d and OS reclaimed it quickly. So can we conclude that ST keep some ‘reusable’ data for the main.js, not the OS cache?
  2. What’s the benefit that ST keep the ‘reusable’ data? At least I didn’t see a noticeable speed boost when reopening the same main.js (5.4MB).
  3. You mentioned “symbol list and/or index”, but how can the list/index cost 100MB or so for the 5.4MB main.js? it looks unreasonable to me.
  4. And I don’t understand why ST still hold the symbols/index after closing the main.js. Does it give any speed boost?

However, it would be reasoneble if it works like this:

static std::vector<type> data; //global 'resuable' data
for(;;) data.push_back(something); //when open main.js
data.clear(); //at closing main.js
assert(data.capacity() is still high); //that's why memory footprint stays

If this is true, I would rather consider it as a ‘leak’. ST should wipe the high capacity of vectors.

0 Likes

#5

ST caches a bunch of data related to syntax highlighting, this is for performance. As I said earlier the memory usage doesn’t come from the index, but instead this cache. That’s also why it doesn’t get freed, as this data is kept until ST is restarted.

0 Likes

#6

that’s not a so-helpful cache, it doesn’t help to speed up reopening main.js.
so it is now clear, by ‘leak’ it means a bunch of less helpful caches.
And it still looks strange to me how the cache cost 100MB for javascript, and does it increase more and more as ST sees more and more “syntax patterns”?
I believe it could be greatly improved.

0 Likes

#7

that’s not a so-helpful cache, it doesn’t help to speed up reopening main.js.

It does, it just also helps the first time around.

does it increase more and more as ST sees more and more “syntax patterns”?

It increases the more unique execution paths in the syntax are taken. As you’ve already seen it doesn’t increase when reopening the same file.

0 Likes

#8

I pasted below codes (python 3.3) into console

import sublime_plugin
from time import time
start_time,remain_files=0,0
class _MeasureLoadTime(sublime_plugin.ViewEventListener):
  def on_reload(self): self.on_load()
  def on_revert(self): self.on_load()
  def on_load(self):
    global start_time,remain_files
    remain_files-=1
    if not remain_files:
      print('it takes {:.3f} seconds'.format(time()-start_time))

def reload_test(files):
  global start_time,remain_files
  start_time,remain_files = time(),len(files)
  for file in files:
    view = window.open_file(file)
    if view.is_loading():
      sublime_plugin.create_view_event_listeners([_MeasureLoadTime], view)
    else:
      view.run_command('revert') #reload

then I repeat the test command: reload_test(["R:/main.js"])
It reports ~3.0 seconds to load the main.js for the first time in a session.
It reports ~3.0 seconds for following loads/reverts, ignoring the ‘cache’.

0 Likes

#9

Not sure what you’re trying to say? As I said the cache is also useful on the first load.

0 Likes

#10

Well, as a user’s perspective I think the cache behavior of ST on syntax-highlighting is less reasonable, and it causes unexpected memory usage when a ST session visited a lot source.

  1. tests show that the cache certainly does not make a second reload faster than the first;
  2. how can it called a ‘cache’ if it doesn’t help? I guess, and it behaves like, that it’s the implementation of the syntax engine that just need such memory to store some information;
  3. so it would be better to free the memory after closing files;
  4. if it really acts as cache, then apparently the cache hit rate is too slow to make any boost. it would be better to discard some ‘cold cache’;
  5. finally, the syntax engine is not fast for files of few megabytes; the ‘cache’ is a good idea but unfortunately the engine have to be improved a bit further to utilize the ‘cache’.
0 Likes

#11

For most serious devs/users it is common case to have hundrets of files with various different syntaxes open at a time, working on them all day long. So in an every day workload it is totally unimportant to eagerly try unloading syntax definitions each time files are closed as they are very likely to be still in use or used again soon.

I don’t have any insights into ST’s sources - as you don’t - but maybe you have a wrong understanding of ‘cache’. Actually ST compiles sublime-syntax files into byte code, which is stored in “cache” files. As such what is kept in RAM is most likely just that byte code of regular expressions and scope names, required to do syntax highlighting. ST loads syntax contexts lazily on demand. Hence footprint depends on parsed content. It is probably not reasonable to force unloading those contexts just to load them again and again. Even if your CPU is fast enough for you not to see significant difference in speed, it causes useless extra cycles consuming power.

Also note that Windows keeps all sorts of resources in RAM, once they have been read from disk.

If Windows runs out of memory it will reclaim RAM from ST and force it to reload recources on demand until then - if enough RAM is available - why not use it?

I like those external experts!

0 Likes

#12

As bschaaf said earlier its nothing about OS cache. And to clarify, there are three ‘cache’ type mentioned in this post: 1 - OS cache (unrelated to this issue), 2 - static cache of syntax definitions as compiled into bytecodes and store in “Cache/{SyntaxName}/*”, 3 - dynamic ‘cache’ binded to different ‘unique execution paths’ as the syntax engine runs over sources;

My point is that, based on current observations, the dynamic ‘cache’ is nearly unique to each file, and make less helpful for any other files even for reloading the same file.

Maybe the team of the syntax engine have the final words.

0 Likes

#13

To be frank: you are not in a position to make any claims as to the usefulness of this cache. You have no understanding of the internals of our syntax engine, nor do you have the tools needed to measure the effectiveness of any caching we do.

I’ve said it twice already: the data we cache is already used during a first load. When you first load a file you’re already using this cache, so comparing load times is almost entirely pointless. Although much more complicated you can think of it like memoization.

Yes, there is some data in the cache(s) that could be freed after loading. But tracking what can be freed is slow (a tracing gc is what we’d likely use) and this data will in all likelyhood be used in the future anyway so there’s little reason to free it.

Maybe the team of the syntax engine have the final words.

FWIW I am the syntax engine team.

@deathaxe you’re mostly on the right track, though only embeds are loaded lazily. This cache is comprised of scope names and full-scale lexer states. A good chunk of the allocated memory is also just the tokens and after closing the file there’s a lot of memory that’s free but won’t be released to the OS.

1 Like

#14

I don’t think taskmgr and my reload_test(["R:/main.js"]) told a fake story to me.

What I can conclude is:

  1. the syntax engine still hold the ‘cache’ because it thinks the data may be still used/referenced by some other opening files, and because lack of tracking;
  2. it is not actually a ‘cache’, it just like ‘same strings share one single data’;
  3. it just avoid duplicated allocations, that’s why it does not boost any further reloads;
  4. ST implemented GC as ‘close-application and restart’, let OS collect any garbages.

In addition to the ‘garbage collection issue’ and ‘loading performance issue’, the cache size (in bytes) is also relatively too large, maybe it could be optimized by something like huffman encoding. I think you syntax engine team will sort everything right finally.

0 Likes

#15

the syntax engine still hold the ‘cache’ because it thinks the data may be still used/referenced by some other opening files, and because lack of tracking;

Large amounts of the data in the cache are actively referenced.

it is not actually a ‘cache’, it just like ‘same strings share one single data’;

That’s how we store tokens and scopes. Lexer states are not like that, and lexer states are the only thing with references that could be cleaned up.

it just avoid duplicated allocations, that’s why it does not boost any further reloads;

See previous.

ST implemented GC as ‘close-application and restart’, let OS collect any garbages.

For this cache, yes that’s our current strategy.

maybe it could be optimized by something like huffman encoding

Optimizing for memory usage by using compression also results in worse performance.

FWIW I agree we certainly can do better here, but your suggestions are unhelpful. You do not know the internals of the syntax engine.

0 Likes

#16

You may not understand what ‘encoding’ I means.

In Windows, it is easy to dump the newly allocated memory by take two snapshots of the process memory space. I dumped the ‘cache’ data for main.js, it is about 115 MB. The data includes the data that represents the source text.

Then all scope names could be found by extract the strings start with 'source.js ', its about 49811 scope names in total.

source.js meta.group.js meta.function.js meta.block.js meta.group.js meta.function.js meta.block.js meta.function.js meta.block.js meta.function-call.arguments.js meta.group.js meta.function-call.arguments.js meta.group.js
source.js meta.group.js meta.function.js meta.block.js meta.group.js meta.mapping.js meta.group.js meta.function.js meta.block.js meta.function.js meta.block.js meta.function.js meta.block.js meta.conditional.js meta.block.js meta.conditional.js meta.brackets.js punctuation.section.brackets.end.js
… (49809 more)

All such strings sum up to ~17 MB (17,817,024 bytes includes ‘\0’ for each name).
We can tokenize the scope names (when parsing the .sublime-syntax file), and make a token table: 0=source.js, 1=meta.group.js; there are 160 such tokens;
so a single byte is enough to represent a token, this will reduce the all the strings to ~1MB from 17MB, saving 16MB from the cache data (115MB).

Also the scope names are used to match some contexts (e.g., coloring, keybindings, bracket matching, indentation rules), it is fast to compose a literal scope name from the tokenized representation. The scope matching also need to tokenize the full name (at least it have to determine the boundry of tokens), so it will help if we have already tokenized them.

This is just a simple thought based on the observation outside of internal implementation.
And surely you engine team can make more practical optimizations, such as structure optimizations.

0 Likes