openerotica
/

writing-roleplay-20k-context-nemo-12b-v1.0

Model card Files Files and versions

basiliskinstitute commited on Oct 13, 2024

Commit

3d18ac5

·

verified ·

1 Parent(s): 4e4e2ec

Create README.md

Files changed (1) hide show

README.md +21 -0

README.md ADDED Viewed

	@@ -0,0 +1,21 @@

+This is a storywriting and roleplay model with a significant amount of self generated long context multiturn roleplay.
+I downloaded a bit under a thousand cards from chub.ai, and created a synthetic roleplay for each card. I batched as many turns as I could in 4k token chunks in order to maintain coherency over longer context. There was a lot of cleaning and validation between each batch, so a lot of examples were "lost,"  but the final output seems to be very good quality. The longest conversation is about 20k tokens, and I plan to extend this further as well as broaden the dataset with more examples. The first 4k tokens were generated with Command-R-Plus, with the remainder generated with byroneverson/Mistral-Small-Instruct-2409-abliterated.
+Next, I downloaded the prompt backup from this site, and used them as a seed for some storywriting data:
+https://aetherroom.club/whats-new#backup-update
+I went over it twice with Command-R-Plus. The first time, having it basically write the first draft of the output, the second improving and extending the length of the original output.
+Also included was a subset of the following datasets:
+anthracite-org/stheno-filtered-v1.1
+anthracite-org/kalo_misc_part2
+anthracite-org/kalo_opus_misc_240827
+anthracite-org/kalo-opus-instruct-22k-no-refusal
+Chaser-cz/sonnet35-charcard-roleplay-sharegpt
+(A very small subset) jondurbin/airoboros-3.2
+And some various other data, viewable at openerotica/mixed-rp
+Every line of data was run through a large model in order to filter for low quality, repetition, and underage content.