[1] |
A ConvNet for the 2020s[EB/OL].(2022-03-02)[2024-05-020].https://arxiv.org/abs/2201.03545.
|
[2] |
InternVL:Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks[EB/OL].(2024-1-15)[2024-05-20].https://arxiv.org/abs/2312.14238.
|
[3] |
How Far Are We to GPT-4V?Closing the Gap to Commercial Multimodal Models with OpenSource Suites[EB/OL].(2024-04-29)[2024-06-03].https://arxiv.org/abs/2404.16821.
|
[4] |
Visual Instruction Tuning[EB/OL].(2023-12-11)[2024-05-06].https://arxiv.org/abs/2304.08485.
|
[5] |
Improved Baselines with Visual Instruction Tuning[EB/OL].(2024-05-15)[2024-06-05].https://arxiv.org/abs/2310.03744.
|
[6] |
Mini-Gemini:Mining the Potential of Multi-modality Vision Language Models[EB/OL].(2024-03-27)[2024-05-29].https://arxiv.org/abs/2403.18814.
|
[7] |
A Survey on Multimodal Large Language Models[EB/OL].(2024-04-01)[2024-05-21].https://arxiv.org/abs/2306.13549.
|
[8] |
A survey of large language models[EB/OL].(2023-03-31)[2024-06-06].https://arxiv.org/abs/2303.18223.
|
[9] |
Attention Is All You Need[EB/OL].(2023-08-02)[2024-05-23].https://arxiv.org/abs/1706.03762.
|
[10] |
Gpt-4 technical report[EB/OL].(2023-03-15)[2024-05-10].https://arxiv.org/abs/2303.08774.
|