basiliskinstitute commited on
Commit
3d18ac5
·
verified ·
1 Parent(s): 4e4e2ec

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +21 -0
README.md ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ This is a storywriting and roleplay model with a significant amount of self generated long context multiturn roleplay.
2
+
3
+ I downloaded a bit under a thousand cards from chub.ai, and created a synthetic roleplay for each card. I batched as many turns as I could in 4k token chunks in order to maintain coherency over longer context. There was a lot of cleaning and validation between each batch, so a lot of examples were "lost," but the final output seems to be very good quality. The longest conversation is about 20k tokens, and I plan to extend this further as well as broaden the dataset with more examples. The first 4k tokens were generated with Command-R-Plus, with the remainder generated with byroneverson/Mistral-Small-Instruct-2409-abliterated.
4
+
5
+ Next, I downloaded the prompt backup from this site, and used them as a seed for some storywriting data:
6
+
7
+ https://aetherroom.club/whats-new#backup-update
8
+
9
+ I went over it twice with Command-R-Plus. The first time, having it basically write the first draft of the output, the second improving and extending the length of the original output.
10
+
11
+ Also included was a subset of the following datasets:
12
+
13
+ anthracite-org/stheno-filtered-v1.1
14
+ anthracite-org/kalo_misc_part2
15
+ anthracite-org/kalo_opus_misc_240827
16
+ anthracite-org/kalo-opus-instruct-22k-no-refusal
17
+ Chaser-cz/sonnet35-charcard-roleplay-sharegpt
18
+ (A very small subset) jondurbin/airoboros-3.2
19
+ And some various other data, viewable at openerotica/mixed-rp
20
+
21
+ Every line of data was run through a large model in order to filter for low quality, repetition, and underage content.