45
45
46
46
## TODO List
47
47
48
+ - [x] Support liger kernels to save GPU memory
49
+ - [x] Release the code, model, and data of MPO
50
+ - [x] Support multimodal packed dataset
48
51
- [ ] Support vLLM and Ollama
49
- - [x] Rebuild documents using readthedocs
50
- - [x] Support fine-tuning different LLMs with LoRA
51
52
- [ ] Support video and PDF input in online demo
52
53
- [ ] Release InternVL2 with VisionLLMv2 integration
54
+ - [x] Rebuild documents using readthedocs
55
+ - [x] Support fine-tuning different LLMs with LoRA
53
56
- [x] Release ` requirements.txt ` for InternVL2
54
57
- [x] Release training / evaluation code for InternVL2 series
55
58
- [x] Release Streamlit web UI for InternVL1.5 and InternVL2
@@ -295,14 +298,14 @@ We welcome everyone to use our API for research. For better management, please s
295
298
296
299
ViT-22B uses the private JFT-3B dataset.
297
300
298
- | method | #param | IN-1K | IN-ReaL | IN-V2 | IN-A | IN-R | IN-Sketch |
299
- | ------------------- | :----: | :---: | :-----: | :---: | :--- : | :- --: | :-------: |
300
- | OpenCLIP-G | 1.8B | 86.2 | 89.4 | 77.2 | 63.8 | 87.8 | 66.4 |
301
- | DINOv2-g | 1.1B | 86.5 | 89.6 | 78.4 | 75.9 | 78.8 | 62.5 |
302
- | EVA-01-CLIP-g | 1.1B | 86.5 | 89.3 | 77.4 | 70.5 | 87.7 | 63.1 |
303
- | MAWS-ViT-6.5B | 6.5B | 87.8 | - | - | - | - | - |
304
- | ViT-22B\* | 21.7B | 89.5 | 90.9 | 83.2 | 83.8 | 87.4 | - |
305
- | InternViT-6B (ours) | 5.9B | 88.2 | 90.4 | 79.9 | 77.5 | 89.8 | 69.1 |
301
+ | method | #param | IN-1K | IN-ReaL | IN-V2 | IN-A | IN-R | IN-Sketch |
302
+ | ------------------- | :----: | :---: | :-----: | :---: | :--: | :--: | :-------: |
303
+ | OpenCLIP-G | 1.8B | 86.2 | 89.4 | 77.2 | 63.8 | 87.8 | 66.4 |
304
+ | DINOv2-g | 1.1B | 86.5 | 89.6 | 78.4 | 75.9 | 78.8 | 62.5 |
305
+ | EVA-01-CLIP-g | 1.1B | 86.5 | 89.3 | 77.4 | 70.5 | 87.7 | 63.1 |
306
+ | MAWS-ViT-6.5B | 6.5B | 87.8 | - | - | - | - | - |
307
+ | ViT-22B\* | 21.7B | 89.5 | 90.9 | 83.2 | 83.8 | 87.4 | - |
308
+ | InternViT-6B (ours) | 5.9B | 88.2 | 90.4 | 79.9 | 77.5 | 89.8 | 69.1 |
306
309
307
310
- Semantic Segmentation [ \[ see details\] ] ( ./segmentation#-evaluation )
308
311
@@ -318,12 +321,12 @@ We welcome everyone to use our API for research. For better management, please s
318
321
319
322
- Zero-Shot Image Classification [ \[ see details\] ] ( ./clip_benchmark#imagenet-variants-and-objectnet )
320
323
321
- | method | IN-1K | IN-A | IN-R | IN-V2 | IN-Sketch | ObjectNet |
322
- | ----------------- | :---: | :--- : | :- --: | :---: | :-------: | :-------: |
323
- | OpenCLIP-G | 80.1 | 69.3 | 92.1 | 73.6 | 68.9 | 73.0 |
324
- | EVA-02-CLIP-E+ | 82.0 | 82.1 | 94.5 | 75.7 | 71.6 | 79.6 |
325
- | ViT-22B\* | 85.9 | 90.1 | 96.0 | 80.9 | - | 87.6 |
326
- | InternVL-C (ours) | 83.2 | 83.8 | 95.5 | 77.3 | 73.9 | 80.6 |
324
+ | method | IN-1K | IN-A | IN-R | IN-V2 | IN-Sketch | ObjectNet |
325
+ | ----------------- | :---: | :--: | :--: | :---: | :-------: | :-------: |
326
+ | OpenCLIP-G | 80.1 | 69.3 | 92.1 | 73.6 | 68.9 | 73.0 |
327
+ | EVA-02-CLIP-E+ | 82.0 | 82.1 | 94.5 | 75.7 | 71.6 | 79.6 |
328
+ | ViT-22B\* | 85.9 | 90.1 | 96.0 | 80.9 | - | 87.6 |
329
+ | InternVL-C (ours) | 83.2 | 83.8 | 95.5 | 77.3 | 73.9 | 80.6 |
327
330
328
331
- Multilingual Zero-Shot Image Classification [ \[ see details\] ] ( ./clip_benchmark#multilingual-imagenet-1k )
329
332
@@ -341,13 +344,13 @@ We welcome everyone to use our API for research. For better management, please s
341
344
342
345
- Zero-Shot Video Classification
343
346
344
- | method | #frame | K400 | K600 | K700 |
345
- | ----------------- | :----: | :--- : | :--- : | :- --: |
346
- | OpenCLIP-G | 1 | 65.9 | 66.1 | 59.2 |
347
- | EVA-02-CLIP-E+ | 1 | 69.8 | 69.3 | 63.4 |
348
- | InternVL-C (ours) | 1 | 71.0 | 71.3 | 65.7 |
349
- | ViCLIP | 8 | 75.7 | 73.5 | 66.4 |
350
- | InternVL-C (ours) | 8 | 79.4 | 78.8 | 71.5 |
347
+ | method | #frame | K400 | K600 | K700 |
348
+ | ----------------- | :----: | :--: | :--: | :--: |
349
+ | OpenCLIP-G | 1 | 65.9 | 66.1 | 59.2 |
350
+ | EVA-02-CLIP-E+ | 1 | 69.8 | 69.3 | 63.4 |
351
+ | InternVL-C (ours) | 1 | 71.0 | 71.3 | 65.7 |
352
+ | ViCLIP | 8 | 75.7 | 73.5 | 66.4 |
353
+ | InternVL-C (ours) | 8 | 79.4 | 78.8 | 71.5 |
351
354
352
355
</details >
353
356
@@ -570,12 +573,12 @@ We welcome everyone to use our API for research. For better management, please s
570
573
571
574
- Multilingual Zero-Shot Image-Text Retrieval on XTD [ \[ see details\] ] ( ./clip_benchmark#xtd )
572
575
573
- | method | EN | ES | FR | ZH | IT | KO | RU | JP | average |
574
- | ----------------- | :--- : | :--- : | :--- : | :--- : | :--- : | :--- : | :--- : | :- --: | :-----: |
575
- | AltCLIP | 95.4 | 94.1 | 92.9 | 95.1 | 94.2 | 94.4 | 91.8 | 91.7 | 93.7 |
576
- | OpenCLIP-XLM-R-H | 97.3 | 96.1 | 94.5 | 94.7 | 96.0 | 90.2 | 93.9 | 94.0 | 94.6 |
577
- | InternVL-C (ours) | 97.3 | 95.7 | 95.1 | 95.6 | 96.0 | 92.2 | 93.3 | 95.5 | 95.1 |
578
- | InternVL-G (ours) | 98.6 | 97.7 | 96.5 | 96.7 | 96.9 | 95.1 | 94.8 | 96.1 | 96.6 |
576
+ | method | EN | ES | FR | ZH | IT | KO | RU | JP | average |
577
+ | ----------------- | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :-----: |
578
+ | AltCLIP | 95.4 | 94.1 | 92.9 | 95.1 | 94.2 | 94.4 | 91.8 | 91.7 | 93.7 |
579
+ | OpenCLIP-XLM-R-H | 97.3 | 96.1 | 94.5 | 94.7 | 96.0 | 90.2 | 93.9 | 94.0 | 94.6 |
580
+ | InternVL-C (ours) | 97.3 | 95.7 | 95.1 | 95.6 | 96.0 | 92.2 | 93.3 | 95.5 | 95.1 |
581
+ | InternVL-G (ours) | 98.6 | 97.7 | 96.5 | 96.7 | 96.9 | 95.1 | 94.8 | 96.1 | 96.6 |
579
582
580
583
</details >
581
584
0 commit comments