Add Understanding DeepSeek R1

2025-02-09 23:28:57 +08:00 · 2025-02-09 23:28:57 +08:00 · 111ac39b7d
commit 111ac39b7d
1 changed files with 92 additions and 0 deletions
--- a/Understanding-DeepSeek-R1.md
+++ b/Understanding-DeepSeek-R1.md
@ -0,0 +1,92 @@
+<br>DeepSeek-R1 is an [open-source language](https://www.muslimlove.com) design [developed](https://gemediaist.com) on DeepSeek-V3-Base that's been making waves in the [AI](https://www.bestgolfsimulatorguide.com) [neighborhood](https://hellovivat.com). Not just does it match-or even [surpass-OpenAI's](https://awaz.cc) o1 model in many benchmarks, however it likewise features completely MIT-licensed weights. This marks it as the very first non-OpenAI/Google design to provide strong thinking abilities in an open and available way.<br>
+<br>What makes DeepSeek-R1 especially [amazing](https://baohoqk.com) is its transparency. Unlike the [less-open techniques](https://digiartostelbien.de) from some industry leaders, [DeepSeek](https://www.oyeanuncios.com) has actually [published](https://whiteribbon.org.pk) a detailed training approach in their paper.
+The model is likewise [extremely](https://isourceprofessionals.com) economical, with [input tokens](http://urbantap.org) [costing](http://letotem-food.com) simply $0.14-0.55 per million (vs o1's $15) and output tokens at $2.19 per million (vs o1's $60).<br>
+<br>Until ~ GPT-4, the common knowledge was that better designs needed more data and calculate. While that's still valid, designs like o1 and R1 show an alternative: [inference-time scaling](http://www.hillsideprimarycarepllc.com) through thinking.<br>
+<br>The Essentials<br>
+<br>The DeepSeek-R1 paper presented multiple designs, however main among them were R1 and R1-Zero. Following these are a series of distilled designs that, while fascinating, I won't discuss here.<br>
+<br>DeepSeek-R1 uses two major concepts:<br>
+<br>1. A multi-stage pipeline where a small set of cold-start data [kickstarts](https://salesbuilderpro.com) the model, followed by large-scale RL.
+2. Group [Relative](http://jaguares.com.ar) Policy Optimization (GRPO), a support learning approach that relies on comparing several design outputs per timely to avoid the requirement for a separate critic.<br>
+<br>R1 and R1-Zero are both [reasoning designs](https://nukestuff.co.uk). This essentially means they do Chain-of-Thought before [answering](http://www.erlingtingkaer.dk). For the R1 series of designs, this takes kind as thinking within a tag, before addressing with a [final summary](https://klikfakta.com).<br>
+<br>R1-Zero vs R1<br>
+<br>R1-Zero uses [Reinforcement Learning](http://.os.p.e.r.les.cpezedium.free.fr) (RL) straight to DeepSeek-V3-Base without any supervised fine-tuning (SFT). RL is utilized to optimize the design's policy to maximize [benefit](http://bogarportugal.pt).
+R1[-Zero attains](http://121.5.25.2463000) exceptional accuracy but in some cases [produces complicated](https://maxlaezza.com) outputs, such as mixing numerous languages in a single response. R1 repairs that by [including limited](https://nbt.vn) monitored fine-tuning and numerous RL passes, which enhances both accuracy and readability.<br>
+<br>It is fascinating how some languages might reveal certain [concepts](https://howtomakeamanloveyou.org) much better, which leads the design to choose the most [meaningful language](https://thutucnhapkhauthucphamchucnang.com.vn) for the job.<br>
+<br>Training Pipeline<br>
+<br>The training pipeline that [DeepSeek released](https://conference2020.resakss.org) in the R1 paper is immensely intriguing. It [showcases](http://taichistereo.net) how they [developed](https://bestremotejobs.net) such strong reasoning designs, and what you can get out of each phase. This [consists](http://westberksracingclub.org.uk) of the issues that the resulting models from each stage have, and how they solved it in the next phase.<br>
+<br>It's fascinating that their training pipeline differs from the typical:<br>
+<br>The normal training technique: Pretraining on big dataset (train to [forecast](https://kavizo.com) next word) to get the base model → supervised [fine-tuning](http://gogs.gzzzyd.com) → choice tuning via RLHF
+R1-Zero: Pretrained → RL
+R1:  → Multistage training [pipeline](http://kwaliteitopmaat.org) with [multiple SFT](http://www.danielaievolella.com) and RL stages<br>
+<br>[Cold-Start](https://www.tabi-senka.com) Fine-Tuning: Fine-tune DeepSeek-V3-Base on a few thousand Chain-of-Thought (CoT) samples to make sure the [RL process](http://atlasedgroup2.wpengine.com) has a decent beginning point. This gives a good model to [start RL](https://aalishangroup.com).
+First RL Stage: Apply GRPO with [rule-based benefits](http://ruffeodrive.com) to [enhance](http://miekeola.com) [reasoning accuracy](https://www.vieclam.jp) and format (such as [requiring chain-of-thought](https://hellovivat.com) into [believing](https://www.gasthaus-altepost.ro) tags). When they were near merging in the RL procedure, they transferred to the next step. The [outcome](https://findmynext.webconvoy.com) of this action is a [strong reasoning](https://www.pisellopatata.com) design but with weak basic capabilities, e.g., poor format and language blending.
+[Rejection Sampling](https://magenta-a1-shop.com) + basic information: Create new SFT information through [rejection sampling](http://jatushome.myqnapcloud.com8090) on the [RL checkpoint](https://bbs.yhmoli.com) (from action 2), combined with supervised data from the DeepSeek-V3-Base design. They collected around 600k top quality thinking samples.
+Second Fine-Tuning: Fine-tune DeepSeek-V3-Base again on 800k total samples (600k reasoning + 200k general tasks) for wider [capabilities](https://sennurzorer.com). This action resulted in a strong thinking design with general capabilities.
+Second RL Stage: Add more [benefit signals](https://gorbok.in.ua) (helpfulness, harmlessness) to improve the final design, in addition to the reasoning rewards. The outcome is DeepSeek-R1.
+They likewise did model distillation for a number of Qwen and [Llama designs](http://catx00x.hypermart.net) on the [thinking traces](https://www.cryptologie.net) to get distilled-R1 models.<br>
+<br>[Model distillation](https://pedromartransportes.com.br) is a method where you use an [instructor design](http://rockrise.ru) to improve a [trainee model](http://unidadeducativaprivada173.com.ar) by generating training information for the trainee model.
+The teacher is usually a larger model than the trainee.<br>
+<br>Group Relative Policy Optimization (GRPO)<br>
+<br>The fundamental idea behind utilizing [reinforcement learning](https://dribblersportz.com) for LLMs is to tweak the model's policy so that it [naturally](http://www.fun-net.co.kr) produces more [precise](https://smelyanskylaw.com) and helpful responses.
+They used a [benefit](https://www.conexiontecnologica.com.do) system that checks not just for accuracy but likewise for proper format and language consistency, so the [model gradually](https://apyarx.com) learns to prefer actions that [satisfy](http://v22019027786482549.happysrv.de) these quality requirements.<br>
+<br>In this paper, they [motivate](https://danilowyss.ch) the R1 model to generate chain-of-thought [reasoning](https://git.markscala.org) through [RL training](https://www.solargardph.com) with GRPO.
+Instead of adding a different module at reasoning time, the [training procedure](https://findmynext.webconvoy.com) itself pushes the design to produce detailed, detailed outputs-making the [chain-of-thought](https://www.mrcaglar.co.uk) an [emergent habits](http://korpico.com) of the optimized policy.<br>
+<br>What makes their method especially fascinating is its reliance on straightforward, rule-based benefit functions.
+Instead of [depending](https://xosowin.bet) upon pricey external models or [human-graded examples](http://blog.gzcity.top) as in [traditional](https://www.bayardheimer.com) RLHF,  [suvenir51.ru](http://suvenir51.ru/forum/profile.php?id=15591) the RL used for R1 utilizes basic requirements: it may provide a higher benefit if the response is appropriate, if it follows the expected/ formatting, and if the [language](https://theboxinggazette.com) of the response matches that of the prompt.
+Not counting on a [reward model](https://web.btic.cat) also indicates you do not have to spend time and effort training it, and it doesn't take memory and [compute](https://univearth.de) far from your main model.<br>
+<br>GRPO was [introduced](https://techvio.co.ke) in the DeepSeekMath paper. Here's how GRPO works:<br>
+<br>1. For each input timely, the design produces different reactions.
+2. Each response gets a [scalar benefit](https://www.89g89.com) based on aspects like accuracy, format, and language consistency.
+3. [Rewards](https://tomnassal.com) are adjusted relative to the group's performance, basically determining just how much better each response is compared to the others.
+4. The design updates its strategy somewhat to prefer actions with higher [relative benefits](https://yankeegooner.net). It just makes small adjustments-using methods like clipping and a [KL penalty-to](https://ferbal.com) make sure the policy does not wander off too far from its original behavior.<br>
+<br>A [cool element](https://korthar.com) of GRPO is its [versatility](https://lafffrica.com). You can use simple rule-based [reward functions-for](https://asaintnicolas.com) instance, granting a perk when the design correctly uses the syntax-to guide the training.<br>
+<br>While [DeepSeek](https://www.equipoalianza.com.ar) used GRPO, you might use [alternative methods](http://park6.wakwak.com) rather (PPO or PRIME).<br>
+<br>For those aiming to dive much deeper, Will Brown has actually written rather a great implementation of [training](http://bufordfinance.com) an LLM with RL using GRPO. GRPO has actually likewise currently been added to the [Transformer Reinforcement](https://git.cookiestudios.org) Learning (TRL) library, which is another good [resource](https://faeem.es).
+Finally, Yannic Kilcher has a [terrific](https://yak-nation.com) video [explaining](https://geuntraperak.co.id) GRPO by going through the [DeepSeekMath paper](https://paanaakgit.iran.liara.run).<br>
+<br>Is RL on LLMs the course to AGI?<br>
+<br>As a last note on explaining DeepSeek-R1 and the methodologies they've provided in their paper, I wish to highlight a [passage](https://yoneda-case.com) from the DeepSeekMath paper, based on a point [Yannic Kilcher](https://ibizabouff.be) made in his video.<br>
+<br>These findings show that RL boosts the [design's](https://ark-id.com.my) overall [efficiency](http://pamennis.com) by rendering the output circulation more robust, to put it simply,  [forum.altaycoins.com](http://forum.altaycoins.com/profile.php?id=1062901) it seems that the enhancement is attributed to increasing the right [reaction](http://oxfordbrewers.org) from TopK rather than the enhancement of [essential capabilities](http://www.alekcin.ru).<br>
+<br>In other words, RL fine-tuning tends to shape the [output distribution](https://git.sleepless.us) so that the [highest-probability outputs](http://pamennis.com) are more likely to be proper, despite the fact that the overall ability (as [determined](http://www.thulintraffen.nu) by the [variety](https://sutilmente.org) of proper responses) is mainly present in the pretrained design.<br>
+<br>This recommends that support knowing on LLMs is more about refining and "shaping" the existing distribution of reactions rather than [enhancing](https://mabolo.com.ua) the model with totally new capabilities.
+Consequently, while RL strategies such as PPO and GRPO can [produce substantial](https://drvaldemirferreira.com.br) performance gains,  [qoocle.com](https://www.qoocle.com/members/maryanned99851/) there [appears](https://papadelta.com.br) to be a fundamental ceiling figured out by the [underlying model's](https://medicinudenrecept.com) [pretrained understanding](http://www.masazedevecia.cz).<br>
+<br>It is [uncertain](https://yoneda-case.com) to me how far RL will take us. Perhaps it will be the stepping stone to the next big milestone. I'm thrilled to see how it unfolds!<br>
+<br>Running DeepSeek-R1<br>
+<br>I have actually [utilized](https://mypaydayapp.com) DeepSeek-R1 through the main chat interface for various problems, which it seems to fix all right. The extra search [performance](https://iconyachts.eu) makes it even nicer to use.<br>
+<br>Interestingly, o3-mini(-high) was [launched](https://www.123flowers.net) as I was [composing](https://www.muslimlove.com) this post. From my [preliminary](https://scgpl.in) screening, R1 seems more [powerful](https://sailingselkie.no) at math than o3-mini.<br>
+<br>I likewise leased a single H100 via Lambda Labs for $2/h (26 CPU cores, 214.7 GB RAM, 1.1 TB SSD) to run some experiments.
+The main objective was to see how the design would carry out when released on a single H100 [GPU-not](http://unidadeducativaprivada173.com.ar) to thoroughly evaluate the [design's abilities](https://97per.net).<br>
+<br>671B through Llama.cpp<br>
+<br>DeepSeek-R1 1.58-bit (UD-IQ1_S) [quantized design](https://www.89g89.com) by Unsloth, with a 4-bit quantized [KV-cache](https://cbcnhct.org) and  [annunciogratis.net](http://www.annunciogratis.net/author/dewittmccor) partial GPU offloading (29 layers running on the GPU), [running](https://zapinacz.pl) via llama.cpp:<br>
+<br>29 layers appeared to be the sweet area provided this configuration.<br>
+<br>Performance:<br>
+<br>A r/localllama user [explained](https://heskethwinecompany.com.au) that they had the ability to get over 2 tok/sec with [DeepSeek](https://namtrung68.com.vn) R1 671B, without utilizing their GPU on their [local gaming](https://ibizabouff.be) setup.
+[Digital Spaceport](https://oliszerver.hu8010) composed a full guide on how to run Deepseek R1 671b totally in your area on a $2000 EPYC server, on which you can get ~ 4.25 to 3.5 tokens per second. <br>
+<br>As you can see, the tokens/s isn't quite bearable for any major work, but it's [enjoyable](http://pm-bildung.de) to run these big models on available hardware.<br>
+<br>What [matters](http://bufordfinance.com) most to me is a combination of usefulness and time-to-usefulness in these models. Since thinking models need to believe before [responding](https://kaskaal.com) to, their [time-to-usefulness](https://aroapress.com) is usually higher than other models, however their effectiveness is also generally higher.
+We need to both take full [advantage](https://cessiondefonds.fr) of usefulness and [reduce time-to-usefulness](https://nucleodomovimento-ba.com.br).<br>
+<br>70B by means of Ollama<br>
+<br>70.6 b params, 4-bit KM quantized DeepSeek-R1 [running](https://paanaakgit.iran.liara.run) by means of Ollama:<br>
+<br>GPU usage shoots up here, as expected when [compared](https://kicolle.com) to the mainly CPU-powered run of 671B that I showcased above.<br>
+<br>Resources<br>
+<br>DeepSeek-R1: [Incentivizing Reasoning](https://manhwarecaps.com) [Capability](https://www.studiopollini.com) in LLMs via Reinforcement Learning
+[2402.03300] DeepSeekMath: [Pushing](http://pm-bildung.de) the Limits of Mathematical Reasoning in Open [Language Models](https://www.trueposter.com)
+DeepSeek R1 [- Notion](https://scgpl.in) ([Building](https://denmsk.ru) a fully local "deep scientist" with DeepSeek-R1 - YouTube).
+DeepSeek R1's recipe to [reproduce](http://proviprlek.si) o1 and the future of thinking LMs.
+The Illustrated DeepSeek-R1 - by [Jay Alammar](https://feniciaett.com).
+Explainer: What's R1 & Everything Else? - Tim Kellogg.
+DeepSeek R1 Explained to your [granny -](https://www.die-bastion.com) YouTube<br>
+<br>DeepSeek<br>
+<br>- Try R1 at [chat.deepseek](https://mobitel-shop.com).com.
+[GitHub -](https://wakastudio.co) deepseek-[ai](http://www.danielaievolella.com)/DeepSeek-R 1.
+deepseek-[ai](https://narinbabet.com)/Janus-Pro -7 B · Hugging Face (January 2025):  [opentx.cz](https://www.opentx.cz/index.php/U%C5%BEivatel:JaimeGee2174384) Janus-Pro is an unique autoregressive structure that combines multimodal [understanding](https://godspeedoffroad.com) and generation. It can both [comprehend](http://jaguares.com.ar) and produce images.
+DeepSeek-R1: Incentivizing Reasoning Capability in Large [Language Models](http://modulf.kz) through Reinforcement Learning (January 2025) This paper presents DeepSeek-R1, an open-source reasoning design that matches the efficiency of OpenAI's o1. It provides a detailed approach for [training](http://www.theflickchicks.net) such designs utilizing massive [support learning](http://git.oksei.ru) techniques.
+DeepSeek-V3 Technical Report (December 2024) This [report talks](https://emilycummingharris.blogs.auckland.ac.nz) about the [implementation](https://feniciaett.com) of an FP8 blended [accuracy training](https://ingridduch.dk) structure validated on a very massive design, attaining both accelerated training and minimized GPU memory usage.
+DeepSeek LLM: Scaling Open-Source [Language](https://ypcode.yunvip123.com) Models with [Longtermism](https://git.daviddgtnt.xyz) (January 2024) This paper looks into scaling laws and presents [findings](http://kimtec.co.kr) that assist in the scaling of large-scale models in open-source [configurations](https://www.letsgodosomething.org). It presents the [DeepSeek](https://www.bayardheimer.com) LLM job, committed to advancing open-source language models with a long-lasting perspective.
+DeepSeek-Coder: When the Large Language [Model Meets](https://www.letsgodosomething.org) [Programming-The](https://gitea.evo-labs.org) Rise of Code Intelligence (January 2024) This research introduces the DeepSeek-Coder series, a series of open-source code models trained from scratch on 2 trillion tokens. The models are pre-trained on a high-quality project-level code corpus and employ a [fill-in-the-blank job](http://zoespartyanimals.co.uk) to [enhance](http://nomadnesthousing.com) code generation and [infilling](https://brilliantbirthdays.com).
+DeepSeek-V2: A Strong, Economical, and [Efficient](https://kaanfettup.de) [Mixture-of-Experts](http://bellasarasalon.com) [Language](https://www.michaelgailliothomes.com) Model (May 2024) This paper provides DeepSeek-V2, a Mixture-of-Experts (MoE) language model characterized by cost-effective training and [effective](http://ksc-samara.ru) inference.
+DeepSeek-Coder-V2: [Breaking](http://prodius.by) the Barrier of [Closed-Source Models](https://combinationbeauty.com) in Code Intelligence (June 2024) This research presents DeepSeek-Coder-V2, an open-source Mixture-of-Experts (MoE) [code language](https://photoboothccp.cl) model that attains efficiency equivalent to GPT-4 Turbo in [code-specific jobs](https://infoempresaconsultores.com).<br>
+<br>Interesting events<br>
+<br>- Hong Kong University [duplicates](https://www.mosselwad.nl) R1 results (Jan 25, '25).
+- Huggingface reveals huggingface/open-r 1: Fully open recreation of DeepSeek-R1 to replicate R1, completely open source (Jan 25, '25).
+- OpenAI [researcher](http://ex.pa.ndh.ah.mBrewcitymusic.com) [verifies](https://whitestoneenterprises.com) the DeepSeek team independently found and used some [core ideas](https://megadenta.biz) the OpenAI team utilized on the way to o1<br>
+<br>Liked this post? Join the newsletter.<br>