2020-04-23 14:40:4 (GMT)
We should make a list of options and collect data.
Some things come to mind: what timeout should we choose for the experiments and what timeout should we create the scheduler for (the timelimit is afaik 20min, but the 2017 scheduler uses 33.3min for UF).
2020-04-23 14:40:20 (GMT)
@Daniel El Ouraoui what options does you branch add?
2020-04-23 14:56:17 (GMT)
The timeout is indeed 20min
2020-04-30 13:29:38 (GMT)
Hans-Jörg said:
Daniel El Ouraoui what options does you branch add?
Here are the options:
--mdli to activate the selection
--limit-sk-qform=<value> limit the maximum number of time you want to skolemize each existential formula, 9 is the best value;
--limit-qform=<value> limit the number of instances par quantified formula, 100 is the best.
It's useful to play a bit with the limit of instances per sort --inst-sorts-threshold=<value>, I found that 100 was a good value.
I think it might be useful to use several configurations in the portfolio script. Somethings like that:
./veriT --disable-print-success --mdli --limit-sk-qform=9 --limit-qform=100 --inst-sorts-threshold=1000 file.smt2
./veriT --disable-print-success --mdli --limit-sk-qform=9 --limit-qform=100 --inst-sorts-threshold=100 file.smt2
./veriT --disable-print-success --triggers-new --triggers-m-cond --mdli --limit-sk-qform=9 --limit-qform=1000 --inst-sorts-threshold=100 file.smt2
./veriT --disable-print-success --triggers-new --triggers-m-cond --mdli --limit-sk-qform=9 --limit-qform=100 --inst-sorts-threshold=100 file.smt2
Could help, it is also possible to tune the value of --limit-sk-qform between [7; 13] and --limit-qform between [40; 150], 10 by 10.
2020-04-30 13:56:39 (GMT)
maybe we should just collect data with those 4 options? and see how/if it fits in?
2020-04-30 16:44:31 (GMT)
Yes maybe, I'll try to tune a bit. How much time I have to give you a good combination? Tomorrow is it ok?
2020-04-30 17:0:10 (GMT)
yeah, the deadline for the first version is the 4th, but I will just use the old scheduling script. The final version is due on the 18th. I think the experiments for the old strategies will take 2-3 more days anyway.
2020-05-02 12:55:44 (GMT)
Ok, let me work a bite on that. In the meantime, you can take the last version on the inst-selection branch if you want. I updated the code to remove some experimental stuff. And I realize that increasing --inst-sorts-threshold, is not good for my optimization.
I think the best is to try this configuration:
./veriT --disable-print-success --mdli --limit-sk-qform=9 --limit-qform=100 --inst-sorts-threshold=100 file.smt2
With and without triggers-new, single and multi triggers.
I'll find out some other values for --limit-qform.
2020-05-04 8:19:26 (GMT)
I just did this weekend some experiments, it seems that using the optimization as a base, with all other configurations, payoffs. I got better result by using --mdli --limit-sk-qform=9 --limit-qform=100 --inst-sorts-threshold=100 by defaults with all different configurations.
Adding a bash variable in the script like that:
restriction_sk_inst="--mdli --limit-sk-qform=9 --limit-qform=100 --inst-sorts-threshold=100"
seems to be the easiest solution.
2020-05-04 9:49:9 (GMT)
Cool. One thing though: I now ran experiments with the existing (42) strategies and it took quite a while (~4 days). I'm not sure how to integrate this. I see three options: 1) repeat the experiment with the option added and select from all the options 2) run experiments with this option added to some options and train on this 3) select from the existing experimental data and then add your option
2020-05-04 9:54:5 (GMT)
anyway I put the results of the experiments so far into sto-verit/Experiment-Results/SMT-COMP-2020-1. There are no soundness bugs, but some other bugs which we could fix if we want.
2020-05-04 19:38:13 (GMT)
Cool, thank you. I think 2) would be the best option, but maybe the toughest... 3) the simpler and 1 a trade-off... If we could try 1) we would have a good overview of the gain with these options, and above all, if it worth it...
I couldn't see sto-verit/Experiment-Results/SMT-COMP-2020-1 is it on the git repo? On devel?
2020-05-06 14:2:21 (GMT)
I just updated the strategies_under_test file on git lab
2020-05-08 14:46:34 (GMT)
Oh nice! Somehow this ended up in its own branch. I don't know why, I merged it now. One note: The first empty line was there on purpose: It's represents calling veriT without an option. I also implemented a simple script to optimize the order of strategies to minimize the total cpu time. I'm now testing the strategy that spit out. Afterwards I'll collect data for your new options.
2020-05-10 16:2:17 (GMT)
I uploaded the experiment results for the new options.
2020-05-10 22:20:37 (GMT)
Yes, I updated using the web interface, maybe it is the reason why it messes up, sorry I didn't know for the empty line. I checked the result thank you. There are in SMT-COMP-2020-3? It seems that this configuration:
--mdli --limit-sk-qform=9 --limit-qform=1000 --inst-sorts-threshold=100 perform well I'm surprised.
2020-05-10 22:27:49 (GMT)
Might be interesting to add this one, it could be even better:
--mdli --limit-sk-qform=15 --limit-qform=1000 --inst-sorts-threshold=100
2020-05-11 8:11:11 (GMT)
Ok I'll try this too and any other strategy you suggest until afternoon, but I think then we should start preparing the full scheduler and the system description.
2020-05-11 9:38:20 (GMT)
the new scheduler has this option (limit-sk-qform=9) as the first strategy and the prediction says it solves 200 more then the last strategy. (on UF)
2020-05-11 15:2:14 (GMT)
Hm I ran a test on gros, but sadly we solved less with the new script, I suspect it's because of a higher number of out-of memory. Gros has much less memory then the grvingt.
2020-05-11 15:16:58 (GMT)
look at this: https://schurr.io/smtcomp/
2020-05-11 15:27:42 (GMT)
arf yes, that's strange because on grvingt, and just with UF, and timeout of 180s, it solves more problems..
I added the result here: https://members.loria.fr/DElOuraoui/veriT_port_UF_results.html
2020-05-11 15:31:18 (GMT)
Is Gros more realistic than Grvingt, SMTCOMP wise?
2020-05-11 15:32:17 (GMT)
no less realistic, on SMT-COMP we have 60GB per benchmark
2020-05-11 15:33:25 (GMT)
Could you share the new scheduler?
2020-05-11 15:34:13 (GMT)
I'll try to check, and maybe adjust it, with grving and a smaller timeout
2020-05-11 15:35:9 (GMT)
2020-05-11 15:40:22 (GMT)
I see we cannot do with shorter timeout because this is what the optimizer computed?
2020-05-11 15:43:32 (GMT)
yeah, but the optimizer tries then to minimize the total time to solve all benchmarks. so testing with smaller timeouts might still be valuable. I was just interested to do one test under competition conditions. The predition is 3433 solved benchmarks, the reality is 3046. the prediction is calculated by summing how many benchmarks have been solved in total by the picked options/time slices pairs in the tests.
2020-05-11 15:46:50 (GMT)
Arf, maybe this should be run again with grvingt, to be sure... I also think that's a memory issue. Gros is much less powerful
2020-05-11 15:48:12 (GMT)
Yeah probably, the problem is that tests with a timeout of 20min take a long time (~6h)
2020-05-11 15:48:14 (GMT)
I'll try to dived all timeout by 10 and run it again under 120 secondes
2020-05-11 17:52:40 (GMT)
Maybe the scheduler doesn't fit in practice.
Here the results on grvingt, with 120s timeout :https://members.loria.fr/DElOuraoui/veriT_NewSched_UF.html
2020-05-11 19:28:47 (GMT)
yeah, I'm in general confused now. I want to check tomorrow what happens to the examples where the optimizer predicts that they will be solved, but then they are not solved. I'm not sure, however, how much we can learn by dividing the timeouts by 10 since the behavior is rather nonline.
2020-05-11 19:29:14 (GMT)
worst case we can still use the old scheduler, or some combination
2020-05-11 19:33:11 (GMT)
Yes, I think I have a clue on what's going on... I just wait for the result of news ran exp, may be I'll got the answer... :fingers_crossed:
2020-05-12 8:30:2 (GMT)
veriT was still printing statistics Infos, but the scheduler was not designed to read this king of output...That was the issue.
I just got the results they are bit better but this is still less than what the optimizer predicts... On UF and 20 min timeout it solves 3359 pbs.
2020-05-12 8:37:19 (GMT)
I disabled STAT in veriT and changed the case by a grep, in the function trywith, to be sure:
res1="$(grep "unsat" <<< $result)"
if [[ $res1 == "unsat" ]]; then echo "$result"; exit 0; fi
2020-05-12 9:28:0 (GMT)
Here the raw results file: https://members.loria.fr/DElOuraoui/SMTCOMP-1200_new_scheduler_raw.txt
2020-05-13 15:28:35 (GMT)
I found a bug in the optimization script which probably resulted in wrong strategies :(
2020-05-13 19:44:8 (GMT)
Oooh that means that there is a hope for a better strategy?
2020-05-14 9:8:25 (GMT)
Not much, but I trained a complete scheduler now, it's in the verit-smtcomp repo. I will also make one for 24s. I noticed that for some of the smaller theories there are singular options that solve all examples. I still need to implement logic to detect this.
2020-05-15 14:29:6 (GMT)
I fixed a couple of bugs and did a lot of refactoring.now there is a retrained scheduler and 24s scheduler in verit-smtcomp
2020-05-15 15:29:28 (GMT)
oh another thing: the prediction is wrong, I used the wrong counting function ^^ it's the number of benchmarks that are solved in total (dh. the union of benchmarks solved by the different options)
2020-05-16 9:54:21 (GMT)
Do you think it is necessary to launch new exps? Where I should add the system description ?
2020-05-17 16:26:50 (GMT)
I played around wit a scheduler for 24s (it's not really worth it), but one thing: we loose examples with a new grep based function (in the 24s scheduler) so I will go back to the old one. I hope veriT never prints out crap
2020-05-17 17:0:53 (GMT)
Just be careful that starexec collapses stdout and stderr and that this year it's the solver's job to capture stderr (if it would throw off the postprocessors)
2020-05-17 19:21:0 (GMT)
So stderr should be eliminated?
2020-05-17 19:24:51 (GMT)
Not necessarily. It's just that the postprocessors expect well behaved outputs, as described in the rules
2020-05-17 19:25:23 (GMT)
so if you have stderr for example printing "unknown" that might be captured by the postprocessor
2020-05-17 19:26:5 (GMT)
that was an issue last year in the competition, so the organizers wrapped all solvers to kill stderr. This year we simplified this by leaving this for the solvers themselves
2020-05-18 8:26:29 (GMT)
the cvc4/veriT scritpt actually merges too
2020-05-18 9:44:40 (GMT)
@Daniel El Ouraoui system descriptions are in verit-extra
2020-05-19 10:56:16 (GMT)
@Daniel El Ouraoui both schedulers do now perform better than predicted: the 24s scheduler solves 3279 the prediction was 3266. The 1200s scheduler solves 3421 with a prediction of 3410